« PreviousContinue »
SYSTEM AND METHOD FOR COMPUTER
OPERATING SYSTEM PROTECTION
BACKGROUND OF THE INVENTION
The invention relates to a system and method for protecting a computer operating system from unexpected errors, and more particularly to a system and method for improving application stability under the Microsoft WINDOWS operating system.
Multitasking, graphics-based operating systems such as Microsoft WINDOWS 95 demand a high degree of expertise from an application programmer. The difficulties inherent in writing synchronized program code in an event-driven, multitasking environment, coupled with a vast and changing system application program interface ("API") consisting of thousands of functions, inevitably results in the production of software programs that contain errors, or "bugs," at several points. Even if an application program is tested relatively thoroughly, some portions of the program code may not be sufficiently exercised to locate the errors. And 2Q even if the erroneous portion is executed during testing, it may cause seemingly benign errors that pass undetected.
User input to software, through the keyboard, mouse, etc., is frequently unpredictable. Because of this, an application may attempt to process a combination of parameters that 25 was not anticipated by the programmer. In this case, too, the program may respond in a benign manner, or in some circumstances may cause certain regions of memory to be inadvertently altered, or "corrupted." Those memory regions might "belong" to the program being executed, or might 30 belong to the operating system or another loaded program. Similarly, the corrupted regions might include important data, or they might be unallocated storage. It generally is not possible to be able to determine, in advance, what regions of memory a defective program might attempt to access. 35
In some circumstances, a programming error may trigger a CPU exception if the program attempts to perform an illegal operation. A CPU exception is the central processing unit's response to an error condition, whether expected or unexpected. For example, an attempt to perform an unde- 40 fined mathematical operation (such as dividing by zero), an attempt to access a memory location that does not exist, or an attempt to execute code that does not satisfy the CPU's syntax requirements, will typically result in a CPU exception. However, not all CPU exceptions result in a "crash" of 45 the system. A CPU exception will cause a software interrupt. That is, when a CPU exception is encountered, processing immediately stops and is transferred to another program location.
That other program location can contain a segment of 50 program code designed to take whatever action is intended by an operating system programmer. For example, an error message can be presented to the operator. Alternatively, if the CPU exception was expected, then other processing can be performed. Such an exception-handling scheme is used in 55 Microsoft WINDOWS and other operating systems to handle "virtual memory," in which disk storage is used to virtually increase the amount of system memory. Some of the contents of system memory are "swapped out" to disk and removed from memory. Upon a later attempt to access 60 those contents, a CPU exception will occur because the contents sought do not exist within system memory. The operating system will then handle that expected CPU exception condition, bring the contents back into system memory, and allow the operation to proceed. 65
Most complex operating systems, including Microsoft WINDOWS 95, use CPU exception handling techniques in
performing a wide variety of operations. Even so, in many cases, a CPU exception will reflect an error or malfunction. In such cases, the operating system will typically not be able to correct the malfunction, and can only present an error message (typically a cryptic one, useless to all but the most experienced and knowledgeable programmers) to the computer operator.
Depending on the nature of the malfunction, and the action, if any, that the operating system takes in an attempt to block or remedy the malfunction, the offending program can perform in one of numerous ways. The system may stop executing and appear to be deadlocked. The application may continue executing despite the possibility that important data has been corrupted. The application may be shut down by the operating system, or may so adversely affect the operating system itself that the computer must be restarted with an accompanying loss of data.
One goal of operating system design is to minimize the possibility of data loss, and the general trend for the most advanced operating systems, such as Microsoft WINDOWS NT, has been to shield (as far as possible) the memory regions containing the operating system's code and data from the reach of an application program. In other words, an application program can alter itself and its own data, but would be entirely unable to affect any other portion of the system, including other application programs and the operating system itself.
However, a rigorous implementation of this architecture may not be feasible in a mass-market operating system which is designed to operate on lower-cost systems, which typically have slower CPUs and tighter system memory constraints. Therefore, the Microsoft WINDOWS 95 operating system, which substantially retains the memory architecture of earlier versions of WINDOWS, remains highly susceptible to many types of program errors. In fact, it is relatively easy to write code that will crash the operating system.
One program of this kind is discussed in Schulman, Unauthorized Windows 95 (IDG Books 1994), and is available from ftp://ftp.ora.com/pub/examples/windows/win95. update/unauthw.html. This program, RANDRW, purports to measure the susceptibility of various operating systems to serious program errors. According to its author, RANDRW makes random memory accesses across the memory range of the system. An access is deemed a "hit" if it is allowed to proceed without being blocked by the operating system. In the WINDOWS 95 environment, Schulman reported a hit rate of approximately 1.5%, indicating that improper accesses were being allowed to occur. It should be noted that the 4 gigabyte address space in which WINDOWS 95 runs is generally about 90% unused and uncommitted, so that the 1.5% hit rate within the 4 gigabyte range translates into a much larger percentage of wrongful memory access and data corruption.
A breakdown of RANDRW memory accesses by address has shown that almost all of the core WINDOWS system components are susceptible to being corrupted in this way. The ease with which a 32-bit application program can affect critical system memory is especially alarming because the entire address range of the processor, including the address ranges occupied by critical system components, is within the accidental reach of the program. Older 16-bit programs are able to reach a narrower extent of system resources, but are still able to cause serious damage.
Unfortunately, it is practically impossible to predict the manner of a malfunction. When one occurs, it is correspond3
ingly difficult to remedy the malfunction so that the program that caused it is able to proceed. If there is an isolated stray access, it may be possible to block the access with no appreciable affect on the program. More likely, an application program was attempting to perform a certain operation 5 when it went awry, and its failure to accomplish the operation will affect further operations. Hence, one fault results in another, and the entire course of the program is altered. In certain circumstances, the CPU context of the program may become damaged. For example, an unbalanced stack may cause the stack pointer to be reset, thereby making continued execution of the program impossible and a haphazard restoration of the CPU context unavailing. A side effect of this latter kind of error is that fault handlers built into the program (even those outside of the application program but executing at the same CPU privilege level as the program) will probably also be unable to execute or will themselves malfunction in the attempt.
In addition, one further type of application failure can be identified, in which the application appears to be deadlocked 20 because it is improperly executing an infinite loop. A failure of this kind will not result in a CPU fault and may not cause any data to be corrupted. However, because the program is essentially deadlocked, it might not accept any further input, necessitating a forced shutdown with data being lost. 25
One prior attempt to address these issues is embodied in the software utility called FIRST AID, various versions of which have been available from Cybermedia, and similarly in subsequent products such as NORTON CRASH GUARD from Symantec and PC MEDIC from McAfee Associates. In 30 FIRST AID, an assumption is made that the architecture of almost all WINDOWS programs is founded on a core piece of program code called the "message loop." In general, after an application program is initialized by creating one or more windows to be displayed on the desktop, it enters the 35 message loop, from which it exits only when the program is terminated. The message loop itself consists of a series of prescribed WINDOWS API function calls that pick up user input and other messages from a systemmanaged queue, associate them with one of the application's windows, and 40 dispatch them to the message handling procedure of the appropriate window for processing.
The majority of an application's program code is contained in its window procedures, and is caused to be executed either, in the first case, indirectly when a message 45 is dispatched from the message loop, or in other cases, by the WINDOWS operating system bypassing the message loop and calling the window procedure directly. Although there are certain other means by which an application's program code can be executed, these are in a minority. Therefore, 50 when a program malfunctions, it is likely to be executing code contained in its window procedures in response to some message.
FIRST AID makes the assumption that the specific message input that caused the error may not be repeatable, and 55 that it may not be necessary to complete processing of the specific message input. Instead, FIRST AID attempts to enter a new message loop at the point that otherwise the program would have been terminated. For this purpose it installs a driver that gains control whenever a CPU fault 60 occurs. Executing within the context of the faulting application, the driver alerts the user to the error condition, and allows him to decide to terminate the application, as would happen normally, or to reactivate it. Reactivating the application consists of a series of steps intended to ensure 65 that certain abnormal conditions are reset, such as enabling input to the application's visible windows. The driver then
enters its own message loop, which is probably fundamentally similar to that contained in the faulting program. Ideally, this will restore the appearance of activity to the application, and the user will be able to access the application's menus and controls at least long enough and well enough to save the application's data to disk.
In less than ideal conditions, however, the method of FIRST AID and subsequent products may be limited to a certain class of application errors, may crash the program by offering to recover it from an error that would not have turned out to be fatal, or may cause the operating system itself to become deadlocked, requiring a system restart. Furthermore, by assuming that the error occurred while the program was executing its own code, FIRST AID ignores the possibility that the error may have occurred within the WINDOWS graphical user interface ("GUI") subsystem. Consequently, by creating a GUI interface (such as a "dialog box") by which the user can choose to recover from the error, and by issuing WINDOWS API calls from within the new message loop, the WINDOWS subsystem may be reentered and further corrupted. The Microsoft documentation for the WINDOWS API function "InterruptRegister" notes in this regard that a fault callback procedure may "execute a nonlocal goto to a known position in the application .... This type of interrupt handling can be hazardous; the system may be in an unstable state and another fault may occur. Applications that handle interrupts in this way must verify that the fault was a result of the application's code." However, such verification is not made.
In addition, FIRST AID and the other known products utilize WINDOWS Kernel services, such as those contained in the "ToolHelp" library, in order to trap the error conditions, and therefore the error handling and recovery code in these products executes at the same CPU privilege level and in the same CPU context as the faulting program. However, as discussed above, depending on the nature of the error (e.g. if the program's stack pointer has been corrupted), it may be impossible or inadvisable to perform any significant operation from within the fault handling procedure, including attempting to reactivate the program by reentering its message loop. Alternatively, stack fault errors may cause the fault handling code to be entered using a separate stack from the one used by the faulting program, in which case FIRST AID will not attempt to return to the original stack prior to resuming the program.
Moreover, certain faults do not cause the fault handling procedure to be executed at all, for example if the original fault ultimately results in another fault occurring within the WINDOWS Kernel as it is attempting to call the fault handling procedure. Finally, neither FIRST AID nor other crash protection implementations provide any safeguards that prevent a malfunctioning program from corrupting the WINDOWS Kernel or other system components.
Another known protection method, embodied in Symantec's NORTON CRASH GUARD product for WINDOWS 95, provides crash recovery as generally described above, and also allows deadlocked applications executing in infinite loops to be reactivated. NORTON CRASH GUARD accomplishes this by providing in its interface an option to reactivate a program that NORTON CRASH GUARD has adjudged to be deadlocked. However, in order to activate the NORTON CRASH GUARD interface and hence reactivate the deadlocked program, the WINDOWS GUI subsystem must be able to perform a focus switch away from the deadlocked program to the NORTON CRASH GUARD interface. Depending on the nature of the deadlock, this may not be possible. For example, it may not be possible to
invoke the NORTON CRASH GUARD interface when the deadlocked program causes the system itself to appear deadlocked because of holding certain resources that the system must acquire in order to activate another program.
Consequently, in view of the known limitations of prior 5 crash protection utilities used in the MICROSOFT WINDOWS environment, it would be desirable to have a utility that is not so limited. Specifically, such a protection utility would allow applications to safely recover from most unanticipated CPU exceptions, at least long enough to save any 1° data. Such a protection utility would also safeguard the operating system from being corrupted by an errant application program, thereby enhancing overall system stability.
SUMMARY OF THE INVENTION 15
The CPU exception handling system and method employed by the invention handles many different types of application-level faults. The invention allows the CPU context of a malfunctioning program to be recovered and repaired outside of the context of the program, thereby 20 permitting recovery from relatively serious errors. Moreover, deadlocked applications can be recovered, even if they cause the system itself to appear deadlocked.
The invention includes a scheme for protecting the oper- 2J ating system and other running software from corruption by a malfunctioning program, thereby substantially confining the effects of an error to the context of the program that caused it.
To accomplish these purposes, the invention adapts the 30 WINDOWS Kernel fault notification dialog to provide that fault notifications and recovery take place within a safe context outside the context of the malfunctioning program. The invention also adapts the WINDOWS Kernel processtermination dialog to provide that deadlocked applications 35 may be recovered even when the system is otherwise deadlocked. Finally, CPU page-level protection is employed to writeprotect major portions of the operating system and prevent them from being corrupted.
BRIEF DESCRIPTION OF THE DRAWINGS 40
FIG. 1 is a diagram illustrating the relationship among functional hardware and software components in a typical computer system operating under WINDOWS 95;
FIG. 2 is a diagram illustrating various portions of system 45 memory as utilized under WINDOWS 95;
FIG. 3 is a flowchart illustrating the steps performed in a CPU exception or infinite loop recovery process performed according to the invention;
FIG. 4 is a flowchart illustrating the procedure followed 50 in protecting certain WINDOWS Kernel segments by the method of the invention;
FIG. 5 is a flowchart illustrating the steps performed in protecting certain portions of the WINDOWS operating 5J system from inadvertent alteration;
FIG. 6 is a flowchart illustrating the process performed in the protection of DOS memory from inadvertent alteration by a WINDOWS program; and
FIG. 7 is a flowchart illustrating the steps performed in 60 protecting the disk cache from inadvertent alteration by an errant program.
DETAILED DESCRIPTION OF THE
A comprehensive scheme of CPU exception handling and operating system protection implemented according to the
invention includes two primary components: (1) trapping and recovering from unexpected CPU exceptions and infinite loops; and (2) protecting portions of the WINDOWS operating system from inadvertent and erroneous attempts to write data. These aspects and components of the invention will be discussed in detail below.
The architecture of a typical computer system running Microsoft WINDOWS 95 is illustrated functionally in FIG. 1. A central processing unit ("CPU") 110 is coupled to system memory 112 and at least one I/O adapter 114. The I/O adapters present in a typically configured computer system can include interfaces to a keyboard, a pointing device such as a mouse, a video display, a printer, a modem, etc. The system is also furnished with operating system software 116, which oversees transactions between the CPU 110 and the system memory 112 or the I/O adapters 114.
Modern operating systems take advantage of this arrangement in numerous ways. For example, the capacity of system memory 112 can be effectively enlarged by a virtual memory method, in which disk storage (attached to the CPU by way of an I/O adapter 114) is used to supplement the system memory 112. When the CPU 110 attempts to access a portion of system memory 112 that is actually stored on disk, a CPU exception, or fault, will result. The operating system 116 anticipates this exception and acts accordingly to bring the requested contents into system memory 112 for access by the CPU 110.
Many CPU exceptions are anticipated and handled by modern operating systems in this manner. However, unanticipated CPU exceptions still occur, caused by careless programming, insufficient testing, or any number of other factors. As discussed above, these errors can cause an application program or the entire operating system 116 to become deadlocked.
The system memory 112 is divided into a number of distinct regions, as shown in FIG. 2. A DOS region 210 occupies the portion of the system memory 112 between its lower end (given an address identifier of zero) and somewhere within a first megabyte 212. The remainder of the first megabyte 212 is devoted to a first portion 214 of the 16-bit Global Heap.
Immediately following the first megabyte is the high memory area ("HMA") 216. An empty region 218 following the HMA 216 up through the four-megabyte boundary 220 typically is not mapped to any system memory.
From the four-megabyte boundary 220 to a two-gigabyte boundary 222 is a private memory region 224. The private memory region 224 is occupied by 32-bit WINDOWS applications, as will be discussed in further detail below. The following, third, gigabyte is occupied by a second portion 226 of the 16-bit Global Heap, which is shared among the programs running on the system.
The remaining system memory, the fourth gigabyte of the four-gigabyte range, is unshared system memory. It includes a system region 228 for virtual device drivers and the system heap, followed by a cache region 230.
The final eight megabytes of addressable memory include the CPU page tables 232, which contain information (such as the write-protection information that will be discussed in further detail below) on the pages of physical memory mapped within the four-gigabyte range of FIG. 2.
It should be noted that while the diagram of FIG. 2 reflects a four-gigabyte range of memory, only a small portion of that range actually is usually occupied by physical system memory. However, pages of physical system memory can be mapped to nearly any portion of the range, and need not be contiguous.