Excerpt from  DRAFT   POSIX Std.  1003.25

Disclaimer:  This section from the Draft POSIX Standard is provided for reference only.  This is the current revision of the draft as of September 20, 2001. This is not guaranteed to be the latest revision.

If you have any comments or questions, please send e-mail to lkessler@users.sourceforge.net.

 

Annex B (informative): Rationale and Notes. 2

B.20        Event Logging. 2

B.20.1     Introduction. 2

B.20.2     Logging of Kernel Events. 2

B.20.3     Event Log Structure; Persistence of Records. 2

B.20.4     Remote Logs; Portable Logs. 3

B.20.5     Integrity of Event Data. 3

B.20.6     Log Maintenance. 5

B.20.7     Performance. 5

B.20.8     Why Not Just syslog?. 6

B.20.9     Data Definitions. 6

B.20.10       Data Formats. 7

B.20.11       Log-Entry Object 7

B.20.12       Write to the Log. 9

B.20.13       Open an Event Log for Read Access. 9

B.20.14       Read from an Event Log. 9

B.20.15       Notify Process of Availability of System Log Data. 9

B.20.16       Reposition the Read Pointer 11

B.20.17       Compare Event Record Severities. 11

B.20.18       Queries = Event Filters. 11

B.20.19       String Equivalents of Event Attributes. 14

B.20.20       Standard Event Types. 15

B.20.21       Revision History. 15

 


Annex B (informative): Rationale and Notes

B.20    Event Logging

B.20.1     Introduction

The standard calls for a single system-wide log, to which all event records are written. Funneling all event records intact through a single logical stream makes it easier for an implementation to monitor and analyze events in a system-wide context, in order to determine where faults may exist. This capability is critical to the consensus model of using the event stream as a conduit for achieving fault-tolerance within the system.

The standard provides for logging of raw binary data, to enable efficient construction and processing of log event records. With purely textual data, hardware sense data and other binary failure data cannot be adequately supported, and analysis options are limited.  The standard also supports the logging of textual data as a special case, and so in this respect is compatible with the functionality commonly found in syslog implementations.

B.20.2     Logging of Kernel Events

The POSIX standard specifies an application’s interface to the operating system.  Therefore, this event logging standard does not attempt to specify an API for logging events generated by the operating system kernel.  There was strong consensus, however, that the event logging system should accommodate events generated by the kernel, that kernel events should be logged to the system log along with application-generated events, and that the format of kernel-generated events should conform to this event logging standard.

B.20.3     Event Log Structure; Persistence of Records

The structure and organization of event logs is left to the implementer's discretion, subject only to the constraints of the specified interfaces.  In particular, although the standard requires that the posix_log_read() function yield a posix_log_entry structure that contains the event record’s attributes, there is no requirement that the attributes be stored in the log in that form.

The standard establishes an open/read/close style of interface for reading of event records, in order to support processing of archival log files and log files from other systems.  This same interface is used to read the system log.

Applications that read the system log should be prepared for new events to be appended to the log at any time.  Other than that, it was felt that, at least between maintenance activities (see Section 20.5), such applications should have a stable view of the event log: once read, event records should not disappear from the application’s view of the log.

B.20.3.1     Order of Events in Log

The standard requires that new events be added at the end of the system log.  This implies that a sequential read of the log will yield the records in chronological order according to when they were written to the log.  However, a record’s timestamp is assigned at the time of the call to posix_log_write().  (It was felt that the timestamp should reflect the time of the event as nearly as possible.)  The working group recognized that, for a variety of reasons, it may be difficult for some implementations to guarantee that events are written to the log in exactly in the order in which they were initiated – especially considering that kernel events may circumvent posix_log_write() entirely on their way to the log.  Therefore, there is no requirement that events appear in the log in order of timestamp.

The implementation is free to support other methods of accessing event logs, so long as the aforementioned sequential read is supported according to the standard.

B.20.4     Remote Logs; Portable Logs

The standard does not provide for direct access to logs on remote systems, nor does it specify how event information is transferred from one system to another. The working group expects that implementers will provide extensions to the standard to meet this need.  The format of an event record is binary rather than textual, so portability (if any) of an event log between different architectures is left to the implementation(s). 

The working group briefly considered the idea of defining an architecture-independent event-log format.  This would enable a conforming program running on one system to display or otherwise interrogate an event log written on another system – even one with a completely different architecture.  However, several issues became immediately apparent.  For example,

The working group concluded that portability of event logs is beyond the scope of the current standard.

B.20.5     Integrity of Event Data

There was much discussion of possible guarantees that should be required of the implementation regarding the availability of log data once a call to posix_log_write() is initiated.  The working group envisioned the following timeline for the lifetime of an event that is logged via the posix_log_write() function.  The timeline would be essentially the same for events logged by the kernel.  As discussed later, some of the steps in this timeline may not occur in the indicated order, if at all.

  1. An application calls posix_log_write() — directly, through posix_log_printf(), or through an  application- or implementation-defined interface.
  2. The implementation decides whether the caller has permission to log the indicated event record.   If not, the call to posix_log_write() fails with EPERM.  The implementation may do additional screening at this point.  (For example, the implementation may be configured to reject all events with a severity of LOG_DEBUG.)  If the implementation decides to reject the event record on such grounds at this point, the call to posix_log_write() fails with ECANCELED.
  3. The implementation captures the event record in temporary storage – for example, in a memory buffer in the kernel or in an event-logging daemon.
  4. The implementation may perform some sort of implementation-defined notification at this point – e.g., if immediate notification is crucial.  (The standard is silent regarding this step, since it was felt that real-time notification about urgent events is not within the scope of event logging.  Except as otherwise noted, the term “notification” refers to a notification that is sent in response to a notification request that was registered via the posix_log_notify_add() function.  See step 8.)
  5. The posix_log_write() call returns zero (success).
  6. The implementation may further screen the event record at this point – for example, to screen out LOG_DEBUG records or to eliminate duplicate event records.  The record may be discarded at this point even though the call to posix_log_write() has succeeded. 
  7. The implementation writes the event record to long-term storage, such as a disk file.
  8. The implementations sends out notifications to processes that have registered via posix_log_notify_add() to be notified when this type of event is logged.
  9. The event record resides in the system log until a log-maintenance activity removes it.

Regarding this timeline, the standard states:

Beyond this, any guarantees about the integrity and/or availability of log data are up to the implementation.

B.20.5.1     Successful Write May Not Imply Successful Read

It was felt that posix_log_write() should complete as quickly as possible, to minimize event-logging overhead in the calling process.  Therefore, it was felt that completion of posix_log_write() should not have to wait for completion of the write to long-term storage (step 7) or delivery of associated notifications (step 8).  It was felt that any screening done by posix_log_write() (step 2) should be very quick.  Additional screening (step 6) could be deferred until after completion of posix_log_write().

Note that even if posix_log_write() succeeds, the associated event record may never become available for reading.  Here’s why:

a.  The implementation is permitted to drop the record at step 6.

b. The event record may be lost from temporary storage before it is written to long-term storage – for example, if the incoming event rate is so high that temporary storage overflows.

c.  The write to long-term storage may fail – for example, if there is no more room in the log file’s filesystem.

d.  A log maintenance activity may delete the record from the log before the record is ever read.

Items (b) and (c) above reflect the inevitability of limitations on memory and disk space, respectively.  (It was felt, however, that once (c) is detected, the implementation should return ENOSPC on subsequent posix_log_write() calls until space is once again made available.)

Items (a) and (d) were more controversial.  They go against the generally accepted philosophy of “log everything and sort it out later.”  Item (a) is also a calculated breach of the implied promise that a successfully written record can be subsequently read.  On the other hand, such filtering may minimize the (less predictable) loss of data due to items (b) and (c).  In any case, it was felt that the filtering implied by items (a) and (d) should be well documented, and configurable by the log administrator.

All of the above notwithstanding, the implementation is free to delay completion of posix_log_write() until after the event has been written to long-term storage and/or notifications have been sent.  (But the performance impact should be well understood.)

B.20.5.2     Assignment of Record ID

The standard requires that, within an event log, records have ascending record IDs.  In particular, kernel events are not allowed to have a different record-ID sequence from that of application-generated events in the same log.

It is not required that there be no gaps in the sequence of record IDs.  Therefore, the implementation is free to assign a record ID to a record that may later be dropped or deleted.  It was felt that the range of possible event IDs is huge enough to allow a certain amount of extravagance in this area.

The record ID must be assigned before notifications associated with this event are delivered (since the notification may include the record ID, and also because record ID is a query criterion), and before the record is first sought (via posix_log_seek()) or read.

B.20.5.3     Logging Events at System Shutdown

Events that occur at the time of an unexpected system shutdown are among the most interesting and useful to log.  Although the standard is silent on this subject, the working group felt that events logged to temporary storage (but not to long-term storage) just before a system shutdown should be written, if possible, to the system log when the system is rebooted.

B.20.6     Log Maintenance

The standard does not specify how or when the system log is to be archived, compacted, or otherwise modified for maintenance purposes.  However, it was recognized that such activities are commonplace, and it was felt that applications should not have to abandon access to the system log when these activities occur.  Section 20.5 describes a mechanism for notifying applications when such a maintenance activity starts and ends, so that the applications can resume access to the log once the activity is completed.

The system log is “always available to accept event records.”  It was felt that the implementation should not reject calls to posix_log_write() simply because they happen to occur during log maintenance activities.  The implementation should either buffer these new records or suspend completion of posix_log_write() calls until maintenance is complete.

B.20.7     Performance

Performance of the event logging system was a concern primarily in the following areas:

As discussed in Section B.20.15, timely (as opposed to efficient) delivery of notifications was not deemed to be a vital performance consideration.

B.20.8     Why Not Just syslog?

The syslog event-logging system, used with many UNIX systems, was considered as a basis for POSIX event logging.  For the most part it was rejected, for the following reasons:

Other event-logging implementations exist that have overcome these shortcomings.  Some of these implementations support the syslog() function for backward compatibility, but feed the syslog()-generated messages into a more flexible logging system.

In view of the widespread use of syslog, it was considered important to be able to implement syslog’s primary features using the POSIX event-logging interface.  For example, the posix_log_printf() function supports syslog()’s printf-like formatting capability for text-based event records. The posix_log_memtostr() function enables a simple, strictly conforming program to produce a textual version of a POSIX event log.  Functions like posix_log_seek() and posix_log_query_match() make it relatively easy to classify events and/or focus only on events of interest.  The posix_log_notify_add() feature enables the creation of programs that watch for new events and take appropriate actions as they occur.

The standard’s set of severity levels is taken directly from syslog, and compatibility with syslog here was felt to be important.  It was generally felt that eight levels of severity should be enough for anybody, although the implementation is free to define additional severities.

The standard also accepts syslog’s set of facilities, again largely for compatibility with syslog.  (While different implementations of syslog tend to use the same set of severity levels, there is much less agreement on the set of facilities.  The set specified in the standard is intended to be a common subset.)   It is fully expected that implementations and/or applications will define additional facilities.  The set inherited from syslog was not felt to be adequate, but there was general consensus as to the futility of trying to define a complete set that would be widely accepted.

B.20.9     Data Definitions

This section of the rationale discusses Section 20.2 of the normative text, with the following exceptions:

B.20.9.1     Facility Codes

The data type posix_log_facility_t is an opaque type that is not an array type.  (Array types were disallowed for this and some other types because of the complexities associated with passing arrays as function arguments in C.)  Existing implementations typically use either character strings or integers as facility codes.  The posix_log_factostr() and posix_log_strtofac() functions were included in recognition of the fact that a facility’s code may not be the same as its name.

Integer codes have the advantage of compactness and simplicity, but have the disadvantage that different systems may assign different numbers to the same facility.  For example, the Volume Manager may be facility number 55 on system A, but number 59 on system B.  This could create problems when analyzing system A’s event log on system B.  One approach to this problem is to make the facility’s code a function of its name – for example, a hash code or checksum.  It was felt that the solution to this problem is implementation- or installation-dependent, and is therefore beyond the scope of the standard.

Note that although posix_log_facility_t cannot be an array type, it can be a struct whose only member is an array – for example:

 

typedef struct {

      char fac_name[20];

} posix_log_facility_t;

B.20.10   Data Formats

The standard is silent about the format of the “variable-data” portion of an event record, with one exception: the POSIX_LOG_STRING format code is provided for the common case where the variable-data portion is a character string.  A variety of other formats were discussed, but none were considered suitable for inclusion in the standard.

B.20.11   Log-Entry Object

The posix_log_entry struct contains those attributes that are included in every event record. There was much discussion as to what sorts of attributes should be included in this standard set.  The attributes that were chosen meet most or all of the following criteria:

B.20.11.1  Standard Attributes

The record ID (log_recid member) is intended to uniquely identify a particular instance of a particular kind of event.  It can be used to locate the event record within the log (e.g., using posix_log_seek()), or to indicate the record’s order in the log relative to other records.

The size and format attributes (log_size and log_format) specify the size and format of the variable-length data portion.

The facility and event type (log_facility and log_event_type) are intended to uniquely identify a type of event.  Different facilities can use the same event-type code for completely different types of events.  The event type can have a variety of uses:

The user ID (log_uid), process ID (log_pid), and time stamp (log_time) were widely viewed as useful or even essential.

The group ID (log_gid), process group ID (log_pgrp), thread ID (log_thread), and processor ID (log_processor) were viewed more critically; but it was felt that these attributes could be very helpful in diagnosing certain types of problems.  In any case, they are compact and easy for the implementation to capture.

The log_processor member was originally of type int, and was called log_cpu. However, with the increasing variety of multiprocessor and/or clustered architectures, and the increasing tendency to partition  processor pools and other resources into multiple virtual systems, it was felt that the implementation should be free to express a processor's ID as something other than an integer.  For similar reasons, some objected to the term “cpu” as shorthand for “processor.”  (For many multiprocessor systems, no one processor is “central.”)

The log_flags member was introduced to support the POSIX_LOG_TRUNCATE flag, and to accommodate other implementation- or application-defined flags.  Other flags were considered for inclusion in the standard, but rejected or moved to other parts of event record.  (For example, POSIX_LOG_STRING started out as an event type, was later made a flag, and was later made a format when the log_format member was introduced.)

B.20.11.2  Non-standard Attributes

Attributes that were considered for inclusion in the log-entry structure, but rejected, include:

·         caller’s source file name and line number (rejected because of space considerations, and because this information can typically be inferred from other information, such as the facility and event type)

·         software version number, or similar information about the facility (rejected because this information can typically be inferred from the facility code and time stamp, given a log of software-installation and -deinstallation events)

·         host ID.  This was rejected because a POSIX event log accumulates events only for the current system, and so this value would be constant throughout the event log.  It was felt by some that such a field might be useful when merging event logs from multiple related systems.  However, this anticipates a particular implementation for merging of event logs; and does not address other issues such as duplicate record IDs in the merged log, and inconsistency among facility codes, user IDs, and so on.  In general, it was felt that merging of event logs from multiple systems is beyond the scope of this standard.  It was also observed that the definition of what constitutes a “host” or “system” is becoming increasingly slippery.

The implementation is free to add additional attributes to the log-entry structure.  It is expected that the implementation would support such attributes in the posix_log_memtostr() function, and would permit their use in query expressions.

The implementation may also allow the packaging of non-standard attributes in the variable portion of the event record.  Depending on the implementation, such attributes might still be permitted in query expressions.  The standard is silent on this subject.

B.20.12   Write to the Log

There were two schools of thought on the handling of a call to posix_log_write() where the length of the variable data exceeds {POSIX_LOG_ENTRY_MAXLEN}.  Some thought that the call should fail, to avoid logging corrupted (truncated) data.  Some thought that the call should succeed with truncated data, to avoid losing the entire record.  The standard’s definition of the {POSIX_LOG_TRUNCATE} flag allows this issue to be decided by the implementation.  There was little support for allowing different behaviors for different applications on the same implementation.

In an event record that has a format of {POSIX_LOG_STRING} and has the {POSIX_LOG_TRUNCATE} flag set, the character string is still guaranteed to be null-terminated.  It was felt that the additional burden this placed on the implementation of posix_log_write() was minor compared to the benefit to application programs.

There is no function to write a record to the system log (or any log) by specifying a posix_log_entry struct and optional data buffer.  As a result, there is no way for a strictly conforming program to copy all or part of an event log to another log.  This capability, if desired, was viewed as a function of the underlying implementation’s log administration duties.

B.20.13   Open an Event Log for Read Access

A log descriptor (posix_logd_t) could very well incorporate a file descriptor.  It might even be a file descriptor.  There was general consensus, however, against the notion that a log descriptor has to be a file descriptor.  Hence the opaque posix_logd_t type, and a per-process limit on log descriptors ({POSIX_LOG_OPEN_MAX}) that is distinct from the per-process limit on file descriptors.

In early drafts of the standard, the posix_log_open() function included a posix_log_query_t argument.  The intent was that a sequential read of the log using the resulting log descriptor would yield only the records that match the query object.  This idea was eventually rejected in favor of the current, more flexible, version of posix_log_seek(), which could be used to provide the same effect.

B.20.14   Read from an Event Log

The working group discussed permitting a NULL value for the entry parameter of posix_log_read(), for use when the read is used only to skip to the next event record.  This was not considered particularly useful, and in any case the implementation must read at least the event’s log_size attribute in order to determine where to find the next record.

B.20.15   Notify Process of Availability of System Log Data

B.20.15.1  Purpose of Notifications

Due to existing practice there was a need for a notification interface for event logging.  Notification eliminates the need for an application to poll the log for entries of interest. Also, since notification is available, there is no need for a blocking form of posix_log_read().

There was discussion of two fundamentally different reasons for posting notifications:

  1. real-time notification requiring a real-time response (e.g., “en