CA2264599A1

CA2264599A1 - Fault resilient/fault tolerant computing

Info

Publication number: CA2264599A1
Application number: CA002264599A
Authority: CA
Inventors: Thomas D. Bissett; Martin J. Fitzgerald V; Paul A. Leveille; James D. Mccollum; Erik Muench; Glenn A. Tremblay
Original assignee: Individual
Current assignee: Marathon Technologies Corp
Priority date: 1996-09-17
Filing date: 1997-09-16
Publication date: 1998-03-26
Also published as: DE69739883D1; EP1000404B1; WO1998012657A1; AU4345697A; JP4544651B2; US6205565B1; JP2001502449A; US5790397A; AU723208B2; EP1000404A4; EP1000404A1

Abstract

Data transfer to computing elements (14a, 14b) is synchronized in a computer system (10) that includes the computing elements and controllers (12) that provide data from data sources to the computing elements. A request for data from a computing element (14a, 14b) is intercepted and transmitted to the controllers (12). At least a first controller responds by transmitting requested data to the computing element and by indicating how a second controller will respond to the intercepted request.

Description

ï»¿101520253035CA 02264599 1999-03-05WO 98/12657 PCT/US97l162l8FAULT BEÂ§ILlENI[FAQLT IOLEBAEE COMPUTLNGï¬ackgrougd of the InventionThe invention relates to fault resilient and faulttolerant computing.Fault resilient computer systems can continue tofunction in the presence of hardware failures. Thesesystems operate in either an availability mode or anintegrity mode, but not both. A system is "available"when a hardware failure does not cause unacceptabledelays in user access. Accordingly, a system operatingin an availability mode is configured to remain online,if possible, when faced with a hardware error. A systemhas data integrity when a hardware failure causes no dataloss or corruption. Accordingly, a system operating inan integrity mode is configured to avoid data loss orcorruption, even if the system must go offline to do so.Fault tolerant systems stress both availabilityand integrity. A fault tolerant system remains availableand retains data integrity when faced with a singlehardware failure, and, under some circumstances, whenfaced with multiple hardware failures.Disaster tolerant systems go one step beyond faulttolerant systems and require that loss of a computingsite due to a natural or man-made disaster will notinterrupt system availability or corrupt or lose data.Typically, fault resilient/fault tolerant systemsinclude several processors that may function as computingInmany instances, it is important to synchronize operationelements or controllers, or may serve other roles.of the processors or the transmission of data between theprocessors.Summary 9; the InventionIn one aspect, generally, the invention featuressynchronizing data transfer to a computing element in aï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218- 2 -computer system including the computing element andcontrollers that provide data from data sources to thecomputing element. A request for data made by thecomputing element is intercepted and transmitted to thecontrollers. Controllers respond to the request and atleast one controller responds by transmitting requesteddata to the computing element and by indicating howanother controller will respond to the interceptedrequest.Embodiments of the invention may include one ormore of the following features. A controller may respondto the intercepted request by indicating that thecontroller has no data corresponding to the interceptedrequest and by indicating that another controller willrespond to the intercepted request by transmitting datato the computing element. Each response to theintercepted request by a controller may include anindication as to how each other controller will respondto the intercepted request.The computing element may compare the responses tothe intercepted request for consistency. When eachresponse includes an indication as to how each othercontroller will respond to the intercepted request, thecomparison may include comparing the indications forconsistency. When responses of two or more controllersinclude requested data, the comparison may includecomparing the data for consistency. The computingelement may notify the controllers of the outcome of thecomparison and that responses have been received from allof the controllers.A controller may be disabled when the responsesare not consistent. In addition, an error condition maybe generated if the computing element does not receiveresponses from all of the controllers within apredetermined time period.ï»¿101520253035CA 02264599 1999-03-05WO 98/12657 PCTIUS97/16218-3-A data source may be associated with a controller,and the controller may obtain the requested data from thedata source in response to the intercepted request.A controller may maintain a record of a status ofanother controller, and may use the record whenindicating how the other controller will respond to theintercepted request. When a data source is associatedwith the other controller, the record may include thestatus of the data source. Each controller may maintainrecords of statuses of all other controllers and may usethe records to indicate how the other controllers willrespond to the intercepted request. When each controlleris associated with a data source, each controller maymaintain records of statuses of data sources associatedwith all other controllers.When a status of a data source associated with acontroller changes, the controller may transmit to thecomputing element an instruction to discard responsesfrom other controllers to the intercepted request. Thecomputing element may respond to the instruction bydiscarding responses from other controllers to theintercepted request and by transmitting to thecontrollers a notification that the responses have beendiscarded. A controller may respond to the notificationby updating a record of the status of the data source.After updating the record, the controller may retransmitthe requested data to the computing element and indicatehow the other controller will respond to the interceptedrequest.When a data source is associated with eachcontroller, each controller may respond to theintercepted request by determining whether an associateddata source is expected to process the request, and whenthe associated data source is expected to process therequest, transmitting the request to the associated dataï»¿101520253035CA 02264599 1999-03-05wo 98/12657 PCTIUS97l16218-4-source, receiving results of the request from theassociated data source, and forwarding the results of therequest to the computing element. When the associateddata source is not expected to process the request, thecontroller may respond by informing the computing elementthat no data will be provided in response to the request.In another aspect, generally, the inventionfeatures maintaining synchronization between computingelements processing identical instruction streams in acomputer system including the computing elements andcontrollers that provide data from data sources to thecomputing elements, with the controllers operatingasynchronously to the computing element. Computingelements processing identical instruction streams eachstop processing of the instruction stream at a commonpoint in the instruction stream. Each computing elementthen generates a freeze request message and transmits thefreeze request message to the controllers. A controllerreceives a freeze request message from a computingelement, waits for a freeze request message from othercomputing elements, and, upon receiving a freeze requestmessage from each computing element processing anidentical instruction stream, generates a freeze responsemessage and transmits the freeze response message to thecomputing elements. Each computing element, uponreceiving a freeze response message from a controller,waits for freeze response messages from other controllersto which a freeze request message was transmitted, and,upon receiving a freeze response message from eachcontroller, generates a freeze release message, transmitsthe freeze release message to the controllers, andresumes processing of the instruction stream.Embodiments of the invention may include one ormore of the following features. The common point in theinstruction stream may correspond to an I/O operation,ï»¿1015202530CA 02264599 1999-03-05W0 98/12657 PCTIUS97/16218-5-the occurrence of a predetermined number of instructionswithout an I/O operation, or both.A controller may include a time update in thefreeze response message, and a computing element, uponreceiving a freeze response message from each controllerto which a freeze request message was transmitted, mayupdate a system time using the time update from a freezeresponse message. The computing element may use the timeupdate from a freeze response message generated by aparticular controller.Upon receiving a freeze response message from eachcontroller to which a freeze request message wastransmitted, a computing element may process datareceived from a controller prior to receipt of freezeresponse messages from the controllers.In another aspect, generally, the inventionfeatures handling faults in a computer system includingerror reporting elements and error processing elements.An error reporting element detects an error condition andtransmits information about the error condition as anerror message to error processing elements connected tothe error reporting element. At least one errorprocessing element retransmits the error message to othererror processing elements connected to the errorprocessing element.In another aspect, generally, the inventionfeatures handling faults in a computer system includingerror reporting elements and error processing elements.Error reporting element detect an error condition andtransmit information about the error condition as errormessages to error processing elements connected to theerror reporting elements. At least one error processingelement combines information from related error messagesfrom multiple error reporting elements and uses theï»¿1015202530CA 02264599 1999-03-05W0 98/12657 PCT/US97ll6218_ 6 _combined information in identifying a source of the errorcondition.The error processing element may use a state tableTheerror processing element may represent an error messageto combine information from related error messages.using an error identifier that identifies a particularerror, an error target that identifies a subcomponentthat caused the error represented by the error message,and a reporting source that identifies an error reportingelement that generated the error message and a path overwhich the error message was received. The errorprocessing element may determine whether error messagesare related by comparing a received error message againststates representing previously received error messages.Other features and advantages will become apparentfrom the following description, including the drawings,and from the claims.Brief Description of the DrawingsFig. 1 is a block diagram of a partially faultresilient system.Fig. 2 is a block diagram of system software ofthe system of Fig. 1.Fig. 3 is a flowchart of a procedure used by anIOP Monitor of the system software of Fig. 2.Fig. 4 is a block diagram of an IPI module of thesystem of Fig. 1.Fig. 5 is a state transition table for the systemof Fig. 1.Fig. 6 is a block diagram of a fault resilientsystem.Fig. 7 is a block diagram of a distributed faultresilient system.ï»¿1015202530CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218- 7 _Fig. 8 is a block diagram of a fault tolerantsystem.Fig. 9 is flowchart of a fault diagnosis procedureused by IOPs of the system of Fig. 8.Fig. 10 is a block diagram of a disaster tolerantsystem.Fig. 11 is a block diagram of software componentsof a computer system.Figs. 12A and 12B are flow diagrams illustratinginformation transferred between CEs and IOPs.Fig. 13 is a flowchart of a flush procedureimplemented by the CE Transport.Figs. 14A and 14B are block diagrams illustratingphysical and logical system configurations.Fig. 15 is a flowchart of a flush procedure.Figs. 16 and 16B are tables of microcode statetransitions.Fig. 17 is a block diagram of a system.Fig. 18 is a flow chart of a procedure for errorprocessing.Fig. 19A is a syndrome source table.Fig. 198 is a state transition table correspondingto the syndrome source table of Fig. 19A.Fig. 20 is a callout hierarchy diagram.Figs. 21A, 21B and 22 are tables listing calloutelements.Fig. 23 is a flowchart of a procedure foridentifying a faulty component.Description of the Preferred EmbodimentsFig. 1 illustrates a fault resilient system 10that includes an I/O processor ("IOP") 12 and twocomputing elements ("CEs") 14a, 14b (collectivelyreferred to as CEs 14). System 10 includes only a singleï»¿101520253035CA 02264599 1999-03-05W0 93/12657 PCT/US97/16218-3-IOP 12 and therefore cannot recover from a failure in IOP12. As such, system 10 is not entirely fault resilient.IOP 12 includes two interâprocessor interconnect("IPI") modules 16a,respectively, to corresponding IPI modules 18a, 18b ofCEs 14 by cables 20a, 20b.processor 22, a memory system 24, two hard disk drives26, 28,includes a processor 32,16b that are connected,IOP 12 also includes aand a power supply 30. Similarly, each CE 14a memory system 34, and a powersupply 36. Separate power supplies 36 are used to ensurefault resilience in the event of a power supply failure.Processors 32a, 32b are "identical" to each other inthat, for every instruction, the number of cyclesrequired for processor 32a to perform an instruction isidentical to the number of cycles required for processorIn the illustratedembodiment, system 10 has been implemented using standard32 andfour megabytes of memory for each of memory systems 24,34.32b to perform the same instruction.Intel 486 based motherboards for processors 22,IOP 12 and CEs 14 of system 10 run unmodifiedoperating system and applications software, with harddrive 26 being used as the boot disk for the IOP and hardIntruly fault resilient or fault tolerant systems thatdrive 28 being used as the boot disk for CEs 14.include at least two IOPs, each hard drive would also beduplicated.In the illustrated embodiment, the operatingsystem for IOP 12 and CEs 14 is DOS.operating systems can also be used.otherIOP 12 canrun a different operating system from the one run by CEs14. IOP 12 could run Unix while Cliis 14 runDOS. This approach is advantageous because it allows CEs14 to access peripherals from operating systems that doHowever,Moreover,For example,not support the peripherals. For example, if CEs 14 wereï»¿1015202530CA 02264599 1999-03-05W0 98/ 12657 PCT/U S97/ 16218-9-running an operating system that did not support CD-ROMdrives, and IOP 12 were running one that did, CEs 14could access the CD-ROM drive by issuing I/O requestsidentical to those used to, for example, access a hardIOP 12 would then handle the translation of theI/O request to one suitable for accessing the CDâROMdrive.drive.Fig. 2 provides an overview of specialized systemsoftware 40 used by system 10 to control the booting andsynchronization of CEs 14, to disable local time in CEs14, to redirect all I/O requests from CEs 14 to IOP 12for execution, and to return the results of the I/Orequests, if any, from IOP 12 to CEs 14. _System software 40 includes two sets of IPI BIOS42 that are ROM-based and are each located in the IPImodule 18 of a CE 14. IPI BIOS 42 are used in bootup andsynchronization activities. When a CE 14 is booted, IPIBIOS 42 replaces the I/O interrupt addresses in thesystem BIOS interrupt table with addresses that arecontrolled by CE Drivers 44. The interrupt addressesthat are replaced include those corresponding to videoservices, fixed disk services, serial communicationsservices, keyboard services, and time of day services.CE Drivers 44 are stored on CE boot disk 28 andare run by CEs 14. CE Drivers 44 intercept I/O requeststo the system BIOS and redirect them through IPI modules18 to IOP 12 for execution.to interrupt requests from IPI modules 18, disable theCE Drivers 44 also respondsystem clock, and, based on information supplied by IOPMonitor 48, control the time of day of CEs 14.An IOP Driver 46 that is located on IOP boot disk26 and is run by IOP 12 handles I/O requests from CEs 14by redirecting them to an IOP Monitor 48 for processing.Thereafter, IOP Driver 46 transmits the results of theï»¿W0 98/ 12657requests from IOP Monitor 48 to CEs 14.CA 02264599 1999-03-05PCT/US97ll62l8-10-IOP Driver 46communicates with CE drivers 44 using a packet protocol.IOP Monitor 48 is located on IOP boot disk 26 andis run by IOP 12.IOP Monitor 48 controls system 10 and5 performs the actual I/0 requests to produce the resultsthat are transmitted by IOP Driver 46 to CEs 14.system software 40 also includes console software10152025303549 that runs on IOP 12 and permits user control of system10.or synchronize a CE 14.Using console software 49, a user can reset, boot,The user can also set one orboth of CEs 14 to automatically boot (autoboot) and/orautomatically synchronize (autosync) after being reset orupon startup.The ability to control each CE 14 isuseful both during normal operation and for testpurposes.Using console software 49, the user can alsoplace system 10 into either an integrity mode in whichIOP Monitor 48 shuts down both CEs 14 when faced with amiscompare error, a first availability mode in which IOPMonitor 48 disables CE 14a when faced with a miscompareerror, or a second availability mode in which IOP Monitor48 disables CE 14b when faced with a miscompare error.Finally, console software 49 allows the user to requestthe status of system 10.In an alternative embodiment,console software 49 is implemented using a separateprocessor that communicates with IOP 12.Each CE 14 runs a copy of the same application andthe same operating system as that run by the other CE 14.Moreover, the contents of memory systems 34a and 34b arethe same, and the operating context of CEs 14 are thesame at each synchronization time.Thus, IOP Monitor 48should receive identical sequences of I/O requests fromCES 14.As shown in Fig. 3, IOP Monitor 48 processes andmonitors I/O requests according to a procedure 100.Initially,IOP Monitor 48 waits for an I/O request fromï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218.. 11 ..one of CEs 14 (step 102).packet from, for example, CE 14b, IOP Monitor 48 waitsfor either an I/O request from CE 14a or for theUpon receiving an I/O requestBecausesystem 10 uses the DOS operating system, which haltsexecution of an application while an I/O request is beingexpiration of a timeout period (step 104).processed, IOP Monitor 48 is guaranteed not to receive anI/O request from CE 14b while waiting (step 104) for theI/O request from the CE 14a.Next, IOP Monitor 48 checks to determine whetherIf not (thatis, if an I/O request packet from CE 14a has arrived),the timeout period has expired (step 106).IOP Monitor 48 compares the checksums of the packets(step 108). If the checksums are equal, IOP Monitor 48processes the I/O request (step 110). After processingthe I/O request, IOP Monitor 48 issues a request to thesystem BIOS of IOP 12 for the current time of day (step112).After receiving the time of day, IOP Monitor 48assembles an IPI packet that includes the time of day andthe results, if any, of the I/O request (step 114) andsends the IPI packet to IOP Driver 46 (step 116) forWhen CEs 14 receive the IPIpacket, they use the transmitted time of day to updatetransmission to CEs 14.their local clocks which, as already noted, are otherwisedisabled.As required by DOS, execution in CEs 14 issuspended until IOP Monitor 48 returns the results of theI/O request through IOP Driver 46. Before execution isresumed, the times of day of both CEs 14 are updated to acommon value corresponding to the transmitted time of dayfrom the IPI packet.time synchronization.Accordingly, the CEs 14 are kept inThe transmitted time of day isdesignated as the meta time. If a multitasking operatingsystem were employed, execution in CEs 14 would not beï»¿101520253035CA 02264599 1999-03-05W0 93/12657 PCT/US97/16218-12-suspended while IOP Monitor 48 performed the I/O request.Instead, processing in CEs 14 would be suspended onlyuntil receipt of an acknowledgement indicating that IOPMonitor 48 had begun processing the I/O request (step110).and would be used by CEs 14 to update the local clocks.After sending the IPI packet to IOP Driver 46, IOPMonitor 48 verifies that both CEs 14 are online (step118), and, if so, waits for an I/O request from one ofthe CEs 14 (step 102).If the timeout period has expired (step 106), IOPMonitor 48 disables the CE 14 that failed to respond(step 119) and processes the I/O request (step 110),The acknowledgement would include the time of dayIf there is a miscompare between the checksums ofthe packets from CEs 14 (step 108), IOP Monitor 48 checksto see if system 10 is operating in an availability modeor an integrity mode (step 120). If system 10 isoperating in an availability mode, IOP Monitor 48disables the appropriate CE 14 based on the selectedavailability mode (step 122), and processes the I/OIOP Monitor 48determines whether the disabled CE 14 has been repairedIf not, IOP Monitor 48 waitsfor an I/0 request from the online CE 14 (step 124).request (steps 110-116). Thereafter,and reactivated (step 118).With one of the CEs 14 disabled, system 10 is no longerfault resilient and IOP Monitor 48 immediately processesa received I/O request (step 110).If system 10 is operating in an integrity modewhen a miscompare is detected, IOP Monitor 48 disablesboth CEs 14 (step 126) and stops processing (step 128).Referring again to Figs. 1 and 2, when theapplication or the operating system of, for example, CE14a makes a non-I/O call to the system BIOS, the systemBIOS executes the request and returns the results to theapplication without invoking system software 40.ï»¿101520253035CA 02264599 1999-03-05W0 93/12557 PCT/U S97/ 16218_ 13 _However, if the application or the operating system makesan I/O BIOS call, CE Driver 44a intercepts the I/OAfter intercepting the I/O request, CE Driver44a packages the I/O request into an IPI packet andtransmits the IPI packet to IOP 12.When IPI module 16a of IOP 12 detects transmissionof an IPI packet from CE 14a,interrupt to IOP Driver 46.IPI packet.As discussed above, IOP Monitor 48 responds to theAsrequest.IPI module 16a generates anIOP Driver 46 then reads theIPI packet from CE 14a according to procedure 100.also discussed, assuming that there are no hardwarefaults, IOP Driver 46 eventually transmits an IPI packetthat contains the results of the I/O request and the timeof day to CEs 14.IPI modules 18 of CEs 14 receive the IPI packetfrom IOP 12. CE Drivers 44 unpack the IPI packet, updatethe time of day of CEs 14, and return control of CEs 14to the application or the operating system running on CEs14.If no I/O requests are issued within a given timeinterval, the IPI module 18 of a CE 14 generates aso-called quantum interrupt that invokes the CE Driver 44of the CE 14.quantum interrupt IPI packet and transmits it to IOP 12.In response, the CE Driver 44 creates aIOP Monitor 48 treats the quantum interrupt IPI packet asThus, IOP Monitor48 detects the incoming quantum interrupt IPI packetan IPI packet without an I/O request.(step 102 of Fig. 3) and, if a matching quantum interruptIPI packet is received from the other CE 14 (steps 104,106, and 108 of Fig. 3), issues a request to the systemBIOS of IOP 12 for the current time of day (step 112 ofFig. 3).of day into a quantum response IPI packet (step 114 ofFig. 3) that IOP Driver 46 then sends to CEs 14 (step 116IOP Monitor 48 then packages the current timeï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218-14..of Fig. 3).response IPI packet by updating the time of day andCE Drivers 44 respond to the quantumreturning control of CEs 14 to the application or theoperating system running on CEs 14.If IOP Monitor 48 does not receive a quantuminterrupt IPI package from the other CE 14 within apredefined timeout period (step 106 of Fig. 3), IOPMonitor 48 responds by disabling the non-responding CE14.As shown in Fig. 1, IPI modules 16, 18 and cables20 provide all of the hardware necessary to produce afault resilient system from the standard Intel 486 based32. An IPImodule 16 and an IPI module 18, which are implementedmotherboards used to implement processors 22,using identical boards, each perform similar functions.As illustrated in Fig. 4, an IPI module 18includes a control logic 50 that communicates I/Orequests and responses between the system bus of aprocessor 32 of a CE 14 and a parallel interface 52 ofIPI module 18.communicates with the parallel interface of an IPI module16 through a cable 20. Parallel interface 52 includes asixteen bit data output port 54, a sixteen bit data inputParallel interface 52, in turn,port 56, and a control port 58. Cable 20 is configuredso that data output port 54 is connected to the datainput port of the IPI module 16, data input port 56 isconnected to the data output port of the IPI module 16,and control port 58 is connected to the control port ofthe IPI module 16.handshaking protocol between IPI module 18 and the IPImodule 16.Control logic 50 is also connected to an IPI BIOSROM 60. At startup, control logic 50 transfers IPI BIOS42 (Fig. 2), the contents of IPI BIOS ROM 60, toprocessor 32 through the system bus of processor 32.Control port 58 implements aï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97I16218- 15 _A QI counter 62, also located on IPI module 18,QIcounter 62 includes a clock input 64 that is connected togenerates quantum interrupts as discussed above.the system clock of processor 32 and a gate input 66 thatis connected to control logic 50. Gate input 66 is usedto activate and reset the counter value of QI counter 62.When activated, QI counter 62 decrements the countervalue by one during each cycle of the system clock ofprocessor 32. When the counter value reaches zero, QIcounter 62 generates a quantum interrupt that, asdiscussed above, activates CE Driver 44 (Fig. 2).CE Driver 44 deactivates QI counter 62 at theCE Driver 44deactivates QI counter 62 by requesting an I/O write at abeginning of each I/O transaction.first address, known as the Q1 deactivation address.Control logic 50 detects the I/O write request anddeactivates QI counter 62 through gate input 66. Becausethis particular I/O write is for control purposes only,control logic 50 does not pass the I/O write to parallelinterface 52. At the conclusion of each I/0 transaction,CE Driver 44 resets and activates QI counter 62 byrequesting an I/O write to a second address, known as theQI activation address. Control logic 50 responds byresetting and activating QI counter 62.In an alternative approach, quantum interrupts aregenerated through use of debugging or other featuresavailable in processor 32. some commonly availableprocessors include debugging or trap instructions thattrap errors by transferring control of the processor to adesignated program after the completion of a selectednumber of instructions following the trap instruction.In this approach, each time that CE Driver 44 returnscontrol of processor 32 to the application or operatingsystem, CE Driver 44 issues a trap instruction toindicate that control of processor 32 should be given toï»¿101520253035CA 02264599 1999-03-05W0 93/12657 PCT/US97/16218- 15 _CE Driver 44 upon completion of, for example, 300instructions. After processor 32 completes the indicated300 instructions, the trap instruction causes control ofIn theevent that an I/O request activates CE Driver 44 prior tocompletion of the indicated number of instructions, CEprocessor 32 to be returned to CE Driver 44.Driver 44 issues an instruction that cancels the trapinstruction.IPI Module 18 is also used in activating anoffline CE 14. As discussed below, before an offline CE14 is activated, the contents of the memory system 34 ofthe active CE 14 are copied into the memory system 34 ofthe offline CE 14.copying on the active CE 14, the processor 32 of theTo minimize the effects of thisactive CE 14 is permitted to continue processing and thememory is copied only during cycles in which the systembus of the processor 32 of the active CE 14 is not inuse.To enable processor 32 to continue processingwhile the memory is being copied, IPI module 18 accountsfor memory writes by the processor 32 to addresses thathave already been copied to the offline CE 14. To do so,control logic 50 monitors the system bus and, when theprocessor 32 writes to a memory address that has alreadybeen copied, stores the address in a FIFO 68. When thememory transfer is complete, or when FIFO 68 is full, thecontents of memory locations associated with the memoryaddresses stored in FIFO 68 are copied to the offline CE14 and FIFO 68 is emptied. FIFO 68is modified to store both memory addresses and theIn other approaches,contents of memory locations associated with theaddresses, or to store the block addresses of memoryblocks to which memory addresses being written belong.IPI module 18 also handles non-BIOS I/O requests.In some computer systems, the BIOS is too slow toï»¿101520253035CA 02264599 1999-03-05W0 98/ 12657 PCTIU S97! 16218_ 17 _effectively perform I/O operations such as video display.As a result, some less structured or less disciplinedoperating systems, such as DOS or UNIX, allowapplications to circumvent the BIOS and make non-BIOS I/Orequests by directly reading from or writing to theThese non-BIOSI/O requests, which cannot be intercepted by changing thesystem interrupt table, as is done in connection with,addresses associated with I/O devices.for example, I/0 disk reads and writes, are problematicfor a system in which synchronization requires tightcontrol of the I/O interface.To remedy this problem, and to assure that evennonâBIOS I/O requests can be isolated and managed by IOP12, IPI module 18 includes virtual I/O devices that mimicThesevirtual I/O devices include a virtual display 70 and avirtual keyboard 72. As needed, other virtual I/Odevices such as a virtual mouse or virtual serial andthe hardware interfaces of physical I/O devices.parallel ports could also be used.In practice, control logic 50 monitors the systembus for read or write operations directed to addressesassociated with non-BIOS I/0 requests to system I/Odevices. When control logic 50 detects such anoperation, control logic 50 stores the informationnecessary to reconstruct the operation in the appropriatevirtual device. Thus, for example, when control logic 50detects a write operation directed to an addressassociated with the display, control logic 50 stores theinformation necessary to reconstruct the operation inEach time that a BIOS I/O request ora quantum interrupt occurs, CE Driver 44 scans thevirtual display 70.virtual I/O devices and, if the virtual devices are notempty, assembles the information stored in the virtualdevices into an IPI packet and transmits the IPI packetto IOP 12. IOP 12 treats the packet like a BIOS I/Oï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218_ 13 -request using procedure 100 discussed above. Whencontrol logic 50 detects a read addressed to a virtualI/0 device, control logic 50 assembles the read requestinto an IPI packet for handling by IOP 12. IOP 12 treatsthe IPI packet like a standard BIOS I/0 request.Referring to Fig. 5, each CE 14 always operates inone of eight states. Because there are only a limitednumber of permissible state combinations, system 10always operates in one of fourteen states. The major CEoperating states are OFFLINE, RTB (ready to boot),BOOTING, ACTIVE, RTS (ready to sync), WAITING, M_SYNC,(synchronizing as master), and S_SYNC (synchronizing asslave). IOP Monitor 48 changes the operating states ofCEs 14 based on the state of system 10 and user commandsfrom console software 49. Through console software 49, aWhenever the userresets a CE 14, or a fault occurs in the CE 14, IOPMonitor 48 changes the state of the CE 14 to OFFLINE.At startup, system 10 is operating with both CEs14 OFFLINE (state 150).states of Fig. 5 (states 152-162) when CE 14a becomesoperational before CE 14b and in the lower states (states166-176) when CE 14b is the first to become operational.If CEs 14 become operational simultaneously, the firstuser can reset a CE 14 at any time.System 10 operates in the upperoperational CE 14 to be recognized by IOP Monitor 48 istreated as the first to become operational.When a CE 14 indicates that it is ready to boot byissuing a boot request, the state of the CE 14 changes toRTB if the CE 14 is not set to autoboot or to BOOTING ifthe CE 14 is set to autoboot. if CE 14aissues a boot request when both CEs 14 are OFFLINE, andCE 14a is not set to autoboot, then the state of CE 14achanges to RTB (state 152). IOP Monitor 48waits for the user, through console software 49, to bootCE 14a. When the user boots CE 14a, the state of CE 14aFor example,Thereafter,ï»¿101520253035CA 02264599 1999-03-05W0 93/12557 PCT/US97/16218-19-changes to BOOTING (state 154). If the user resets CE14a, the state of CE 14a changes to OFFLINE (state 150).If both CEs 14 are OFFLINE when CE 14a issues aboot request, and CE 14a is set to autoboot, the state ofCE 14a changes to BOOTING (state 154). If CE 14a bootssuccessfully, the state of CE 14a changes to ACTIVE(state 156).When CE 14a is ACTIVE, and CE 14b issues a bootrequest, or if CE 14b had issued a boot request while thestate of CE 14a was transitioning from OFFLINE to ACTIVE(states 152-156), the state of CE 14b changes to RTS(state 158) if CE 14b is set to autosync and otherwise toWAITING (state 160). If the state of CE 14b changes toRTS (state 158), IOP Monitor waits for the user to issuea synchronize command to CE 14b. When the user issuessuch a command, the state of CE 14b changes to WAITING(state 160) . IOnce CE 14b is WAITING, IOP Monitor 48 copies thecontents of memory system 34a of CE 14a into memorysystem 34b of CE 14b.complete, IOP Monitor 48 waits for CE 14a to transmit aquantum interrupt or I/O request IPI packet.Once the memory transfer isUponreceipt of such a packet, IOP Monitor 48 changes thestate of CE 14a to M_SYNC and the state of CE 14b toS_SYNC (state 162), and synchronizes the CEs 14. Thissynchronization includes responding to any memory changesthat occurred while IOP Monitor 48 was waiting for CE 14ato transmit a quantum interrupt or I/O request IPIpacket. Uponstates of theand system 10completion of the synchronization, theCEs 14 both change to ACTIVE (state 164)is deemed to be fully operational.In an alternative implementation, IOP Monitor 48does not wait for memory transfer to complete beforechanging the state of CE 14a to M_SYNC and the state ofCE 14b to S_SYNC (state 162). Instead, IOP Monitor 48ï»¿101520253035CA 02264599 1999-03-05W0 98Il2657 PCT/US97/16218-20-makes this state change upon receipt of an IPI packetfrom CE 14a and performs the memory transfer as part ofthe synchronization process.Similar state transitions occur when CE 14b is thefirst CE 14 to issue a boot request. Thus, assuming thatCE 14b is not set to autoboot, CE 14b transitions fromOFFLINE (state 150) to RTC (state 166) to BOOTING (state168) to ACTIVE (state 170). Similarly, once CE 14b isACTIVE, and assuming that CE 14a is not set to autosync,CE 14a transitions from OFFLINE (state 170) to RTS (state172) to WAITING (state 174) to S_SYNC (state 176) toACTIVE (state 164).In other embodiments of the invention, forexample, referring to Fig. 6, a fault resilient system200 includes two IOPs 202 and two CEs 204. Each CE 204is connected, through an IPI card 206 and a cable 208, toan IPI card 210 of each IOP 202. IOPs 202 areredundantly connected to each other through IPI cards 210and cables 212.has a redundant backup component, system 200 is entirelyBecause every component of system 200fault resilient. In an alternative approach, cables 208and 210 could be replaced by a pair of local areanetworks to which each IOP 202 and CE 204 would beconnected. Indeed, local area networks can always besubstituted for cable connections.System 200 is operating system and applicationsoftware independent in that it does not requiremodifications of the operating system or the applicationsoftware to operate. Any single piece of hardware can beupgraded or repaired in system 200 with no serviceinterruption. Therefore, by sequentially replacing eachpiece of hardware and allowing system 200 toresynchronize after each replacement, the hardware ofsystem 200 can be replaced in its entirety withoutservice interruption. Similarly, software on system 200ï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218-21..can be upgraded with minimal service interruption (thatis, during the software upgrade, the application willbecome unavailable for an acceptable period of time suchas two seconds). Also, disaster tolerance for purposesof availability can be obtained by placing each IOP/CEpair in a separate location and connecting the pairsthrough a communications link.Referring to Fig. 7, a distributed, highperformance, fault resilient system 220 includes twosystems 200, the IOPs 202 of which are connected to eachother, through IPI modules, by cables 222. System 220uses distributed computing environment software toachieve high performance by running separate portions ofSystem 220 is faulttolerant and offers the ability to perform both hardwarean application on each system 200.and software upgrades without service interruption.Referring to Fig. 8, a fault tolerant system 230includes three IOPs (232, 234, and 236) and three CEs(238, 240, and 242). Through IPI modules 244 and cables246, each IOP is connected to an IPI module 244 of eachof the other IOPs. Through IPI modules 248 and cables250, each CE is connected to an IPI module 244 of two ofthe IOPs, with CE 238 being connected to IOPs 232 and234, CE 240 being connected to IOPs 232 and 236, and CE242 being connected to IOPs 234 and 236.200, system 230 allows for hardware upgrades withoutLike systemservice interruption and software upgrades with onlyminimal service interruption.As can be seen from a comparison of Figs. 7 and 8,the CEs and IOPs of systems 200 and 230 are identicallyconfigured. As a result, upgrading a fault resilientsystem 200 to a fault tolerant system 230 does notrequire any replacement of existing hardware and entailsthe simple procedure of adding an additional CE/IOP pair,connecting the cables, and making appropriate changes toï»¿1015202530CA 02264599 1999-03-05W0 98112657 PCT/US97/16218-22-the system software. This modularity is an importantfeature of the paired modular redundant architecture ofthe invention.Because the components of system 230 are triplyredundant, system 230 is more capable of identifying theThus,while system 10 simply disables one or both of CEs 14source of a hardware fault than is system 10.when an error is detected, system 230 offers a higherdegree of fault diagnosis.In one approach to fault diagnosis, as shown inFig. 9, each IOP (232, 234, 236) of system 230 performsfault diagnosis according to a procedure 300.each IOP (232, 234,power loss, broken cables, and nonfunctional CEs or IOPsInitially,236) checks for major faults such asusing well known techniques such as power sensing, cablesensing, and protocol timeouts (step 302). When such afault is detected, each IOP disables the faulty deviceor, if necessary, the entire system.After checking for major faults, each IOP waits toreceive IPI packets (that is, quantum interrupts or I/0requests) from the two CEs to which the IOP is connected(step 304). Thus, IOP 232 waits to receiveIPI packets from CEs 238 and 240.packets from both connected CEs, each IOP transmits thechecksums ("CRCs") of those IPI packets to the other twoIOPs and waits for receipt of CRCs from the other twoIOPs (step 305).After receiving the CRCs from the other two IOPs,for example,After receiving IPIeach IOP may generate a three by three matrix in whicheach column corresponds to a CE, each row corresponds toan IOP, and each entry is the CRC received from theï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218-23-columnâs CE by the row's IOP (step 308). Thus,example, IOP 232 generates the following matrix:forCE 238 CE 240 CE 242IOP 232 I CRC CRC XIOP 234 Â§ CRC X CRCIOP 236 } X CRC CRCAfter generating the matrix, IOP 232 sums the entries inIf the three rowsums are equal and the three column sums are equal (step310), then there is no fault and IOP 232 checks again formajor faults (step 302).each row and each column of the matrix.If either the three rowsâ sums or the threecolumnsâ sums are unequal (step 310), then IOP 232compares the CRC entries in each of the columns of thematrix. If the two CRC entries in each column match(step 312), then IOP 232 diagnoses that a CE failure hasoccurred and disables the CE corresponding to the columnfor which the sum does not equal the sums of the othercolumns (step 314).If the CRC entries in one or more of the matrixcolumns do not match (step 312), then IOP 232 determineshow many of the columns include mismatched entries. Ifthe matrix includes only one column with mismatchedentries (step 315), then IOP 232 diagnoses that the pathbetween the IOP corresponding to the matrix row sum thatis unequal to the other matrix row sums and the CEcorresponding to the column having mismatched entries hasfailed and disables that path (step 316). For purposesof the diagnosis, the path includes the IPI module 244 inthe IOP, the IPI module 248 in the CE, and the cable 250.If the matrix includes more than one column withmismatched entries (step 314), then IOP 232 confirms thatone matrix row sum is unequal to the other matrix rowsums, diagnoses an IOP failure, and disables the IOPcorresponding to the matrix row sum that is unequal tothe other matrix row sums (step 318).ï»¿101520253035CA 02264599 1999-03-05WO 98/12657 PCT/US97/16218_ 24 _If, after diagnosing and accounting for a CEfailure (step 314), path failure (step 316), or IOPfailure (step 318), IOP 232 determines that system 300still includes sufficient non-faulty hardware to remainoperational, IOP 232 checks again for major faults (step302).can continue to operate even after several componentsBecause system 230 is triply redundant, system 230have failed. For example, to remain operating in anavailability mode, system 230 only needs to have a singlefunctional CE, a single functional IOP, and a functionalpath between the two.Using procedure 300, each IOP (232, 234, 236) cancorrectly diagnose any single failure in a fullyoperational system 230 or in a system 230 in which oneelement (that is, a CE, an IOP, or a path) has previouslybeen disabled. In a system 230 in which an element hasbeen disabled, each IOP accounts for CRCs that are notreceived because of the disabled element by using valuesthat appear to be correct in comparison to actuallyreceived CRCs.Procedure 300 is not dependent on the particulararrangement of interconnections between the CEs and IOPs.To operate properly, procedure 300 only requires that theoutput of each CE be directly monitored by at least twoIOPs.system using any interconnect mechanism and does notThus, procedure 300 could be implemented in arequire point to point connections between the CEs andIOPS.to at least two local area networks.For example, the CEs and IOPs could be connectedIn an alternativeapproach, instead of summing the CRC values in the rowsand columns of the matrix, these values can be comparedand those rows or columns in which the entries do notmatch can be marked with a match/mismatch indicator.A simplified version of procedure 300 can beimplemented for use in a system 200. In this procedure,ï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/l6218.. 25 âeach IOP 202 of system 200 generates a two by two matrixin which each column corresponds to a CE 204 and each rowcorresponds to a IOP 202:CEVQO4 CE 2 4IOP 202 } CRC CRCIOP 202 } CRC CRCAfter generating the matrix, each IOP 202 attaches amismatch indicator to each row or column in which the twoentries are mismatched.If there are no mismatch indicators, then system200 is operating correctly.If neither row and both columns have mismatchindicators, then an IOP 202 has faulted.the operating mode of system 200, an IOP 202 eitherDepending onTheIOP 202 to be disabled is selected based on user supplieddisables another IOP 202 or shuts down system 200.parameters similar to the two availability modes used insystem 10.If both rows and neither column have mismatchindicators, then a CE 204 has faulted. In this case,IOPs 202 respond by disabling a CE 204 if system 200 isoperating in an availability mode or, if system 200 isoperating in an integrity mode, shutting down system 200.If both rows and one column have mismatch indicators,then one of the paths between the IOPs 202 and the CE 204corresponding to the mismatched column has failed.202shutDepending on the operating mode of system 200, IOPseither disable the CE 204 having the failed path ordown system 200. If both rows and both column havemismatch indicators, then multiple faults exist and IOPs202 shut down system 200.If one row and both columns have mismatchindicators, then the IOP 202 corresponding to themismatched row has faulted. Depending on the operatingmode of system 200, the other IOP 202 either disables thefaulty IOP 202 or shuts down system 200. If one row andï»¿1015202530CA 02264599 1999-03-05W0 93/12557 PCTIUS97/16218-26-one column have mismatch indicators, then the pathbetween the IOP 202 corresponding to the mismatched rowand the CE 204 corresponding to the mismatched column hasfailed.IOPs 202 either account for the failed path in futureprocessing or shut down system 200.Depending on the operating mode of system 200,Referring to Fig. 10, one embodiment of a disastertolerant system 260 includes two fault tolerant systems230 located in remote locations and connected bycommunications link 262, such as Ethernet or fiber, andoperating in meta time lockstep with each other. Toobtain meta time lockstep, all IPI packets aretransmitted between fault tolerant systems 230. Likesystem 220, system 260 allows for hardware and softwareupgrades without service interruption.As shown, the paired modular redundantarchitecture of the invention allows for varying levelsof fault resilience and fault tolerance through use ofCEs that operate asynchronously in real time and arecontrolled by IOPs to operate synchronously in meta time.This architecture is simple and cost-effective, and canbe expanded or upgraded with minimal difficulty.Fig. 11 depicts the components of the systemsoftware architecture of an alternative embodiment of theinvention that includes multiple CEs and multiple IoPs.Each CE 1100 includes a set of one or more PhysicalDevice Redirectors 1105, a CE Transport layer 1110, andan IPI Driver 1115.intercept I/0 requests directed to peripheral devices,The Physical Device Redirectors 1105package the requests, and send the requests to the CETransport 1110. The Physical Device Redirectors 1105also receive responses to the requests from the CETransport 1110, unpackage the responses, and return theresponses to the operating system or applicationsï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCTlUS97/16218-27-software that initially made the I/O requests to theperipheral devices.The system software architecture depicted in Fig.Insuch an environment, each thread is a separate stream of11 supports a multi-threaded processing environment.instructions that may be processed by the computingelement. When a Physical Device Redirector 1105intercepts an I/O request by a particular thread,processing of that thread stops until the Physical DeviceRedirector 1105 returns the response to the request.The CE Transport 1110 communicates I/0 requestsbetween the CEs and the IOPs. The CE Transport alsokeeps track of responses expected and received from theIOPs on a per request basis. The CE Transport searchesfor completed requests (i.e., requests to which all IOPshave responded) as IOP responses to redirected requestsare received and sends the resultant data to the PhysicalDevice Redirector 1105.Each IOP 1150 includes an IPI driver 1155, a statecontrol program (SCP) 1160, a fault handler 1165, an IOPTransport layer 1170, a Device Synchronization Layer(DSL) 1175, and a set of one or more Physical DeviceProviders 1180. The IPI Drivers 1115, 1155 of both theCEs and the IOPs control the actual transmission andreception of data along the interconnect paths 1185between the CEs and the IOPs.1160 initiates and responds to state transitionsinvolving the IOPs and the CEs. The fault handler 1165responds to reports of detected faults by deconfiguringThe IOP Transport 1170transports data between the IPI Driver and the otherThe state control programappropriate hardware components.software components of the IOPs. The DeviceSynchronization Layer 1175 is responsible forsynchronizing all redirected requests and responsesbetween IOP-based peripherals. This synchronizationï»¿101520253035CA 02264599 1999-03-05W0 93/12557 PCT/US97/16218- 23 -includes the reprocessing and restructuring of therequests and responses due to state transitions of theThe PhysicalDevice Providers 1180 unpackage requests received fromIOPs, the CEs, or the peripheral devices.the DSL 1175, process the requests, package responses,and send the responses to the DSL 1175. Each IOP alsomay include an ethernet driver 1190 or other mechanismthat permits direct communication between the IOPs.As noted above, the architecture of the describedembodiments requires redirection of all I/0 requests fromthe CEs to one or more IOPs. Responses to theseredirected requests must be synchronized. Accordingly,all IOPs responding to a redirected request must respondto the request in a known way and must describe how allother IOPs will respond. Moreover, an IOP must respondto a redirected request even if the IOP does not have adevice capable of processing the request. Theserequirements permit software running on the CEs tocompare the responses of the IOPs for consistency.Figs. 12A and 12B provide an illustrative exampleof appropriate IOP responses (Fig. 12B) to requests by aCE (Fig. 12A). As shown, each disk of a two-disk shadowset is associated with a different IOP, with the disk1200 associated with the first IOP 1205 functioningnormally and the disk 1210 associated with the second IOP1215 being offline. A request from the CE 1100 to theshadow set for data is redirected to the IOPs, and bothIOPs respond. (The second IOP 1215 receives all requestsdirected to the shadow set even though its associateddisk 1210 is offline.) Each IOP indicates that the firstIOP 1205 is responding with the expected data and thatthe second IOP 1215 is responding with no data.The CE Transport 1110 (Fig. 11) monitors theresponses provided by the IOPs. The CE Transport comparesthe responses for consistency once all IOPs haveï»¿101520253035CA 02264599 1999-03-05W0 98/ 12557 PCT/US97Il6218- 29 _responded to a request. The responses must be consistentin indicating which IOPs responded to the request withactual data and which IOPs did not.than one IOP responds with data, then the data providedMoreover, if moreby the IOPs must be consistent. If the responses passthe consistency checks, then the CE Transport provides aresponse containing actual data to the appropriatePhysical Device Redirector 1105 in the CE 1100 andThereafter, the CETransport informs the IOPs that all IOPs have respondeddiscards the no-data response.to the request and further informs the IOPs of theoutcome of the comparison. If the responses do not passthe consistency checks, then one of the IOPs is disabledand, in effect, removed from the architecture.The CE Transport 1110 implements therequest-response architecture employed by the system ofFig. 11 using the procedure 1300 illustrated in Fig. 13.The CE Transport first instructs the IPI Driver 1115 tosend a request from a Physical Device Redirector 1105 toall IOPs 1150 (step 1305). At the same time, the CETransport 1110 initializes a timer. If the timer has notexpired (step 1310) before all responses are received(step 1315), then the CE Transport compares the responsesto determine whether they are consistent (step 1320). Ifthe responses are consistent and there are multipleresponses that include data (step 1325), then the CEtransport compares the data responses (step 1330). Ifthe data responses are consistent, or if there is only asingle data response, then the CE Transport sends thedata to the appropriate Physical Device Redirector 1105(step 1335). Finally, the CE Transport instructs the IPIdriver to send a Response Complete message to all of theIOPs (step 1340).If the timer expires (step 1310) before allresponses are received, or if inconsistent responses areï»¿101520253035CA 02264599 1999-03-05W0 93/12557 PCTIUS97/16218-30..received (steps 1320, 1330), then the CE Transportreports the occurrence of an error (step 1345) andinstructs the IPI driver to send a Response Completemessage to all of the IOPs (step 1340).In the IOPs, the Device Synchronization Layer(DSL) 1175 makes all IOPâbased physical devices appear toIn addition, theDSL combines device and state information from all IOPsthe CEs as logical or virtual devices.and uses this information to project a single logical IOPto the Physical Device Redirectors of the CEs. Thus, asystem configured as illustrated in Fig. 14A would appearto the CE-based Physical Device Redirectors as having thelogical representation illustrated in Fig. 14B.The DSL 1175 represents devices logically so thatthe CEâbased redirectors have no knowledge of theForexample, a SCSI redirector would not know that Disk C:(1400, 1405) and Disk D: (1410,and would simply treat them as though they were each a1425). Also,would not know that multiple Ethernet controllers providelocations and characteristics of physical devices.1415) are shadowed drivessingle drive (1420, an Ethernet redirectora primary controller (1430) and a secondary controller(1435) and instead would be aware only of a singleEthernet controller (1440).The DSL 1175 receives all redirected requests fromthe IOP Transport 1170. The DSL then decides whether tosend a request to a Physical Device Provider 1180 or toA critical function of the DSLis that it only sends a request to a device provider whensend a no-data response.that provider is expected to process the request. Forexample, if two IOPs control a shadow set of disks, withthe first IOP controlling an offline disk and the secondIOP controlling an active disk, the DSL of the first IOPwould not send a request to its disk, and would insteadrespond with a no-data response. The DSL on the secondï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218-31-IOP would send the request to its disk, and wouldeventually respond with the data from its disk along withan indication that the first IOP would be providing anoâdata response.With every response, the DSL 1175 indicates howevery other IOP will respond. The DSL accomplishes thisby maintaining a record of the status of every other IOPand associated device in an internal device state table.The DSL 1175 also accounts for the effects ofstate changes in the CEs, the IOPs or the peripheraldevices. For example, the DSL accounts for the situationthat arises when the state of a peripheral device changesafter an IOP indicates how the other IOPs will respond toa request, but before the other IOPs actually respond tothe request. If the other IOPs were to respond after thestate change, their responses could differ from theThe DSL solves thisproblem by disabling all response transmissions from theresponse expected by the first IOP.affected device and initiating a flush sequence thatcauses the CE Transport to discard all previouslyreceived, incomplete responses for the particular device.(As previously noted, the CE Transport does not considera request to be complete until responses to the requesthave been received from all IOPs.)Upon receipt of a flush completion indication fromthe CE Transport 1110, the DSL 1175 of each IOP 1150updates its internal device state table and reenablestransmission of responses for the particular device.Finally, the DSL reprocesses any incomplete responsesthat were submitted to the CE Transport prior to theflush.As noted above, the CE Transport 1110 sends anindication to the DSL 1175 of each IOP upon completion ofeach response. As such, the DSL always knows whichresponses are complete. This permits the DSL to keepï»¿101520253035CA 02264599 1999-03-05W0 98/ 12657 PCT/U S97/ 16218-32-track of incomplete responses. When a device statechange occurs and an ensuing flush is issued, the DSLcan, upon receipt of the flush complete indication, resetits internal knowledge of the device state and reâissueany affected incomplete requests. This functionality iscritical, for example, to the implementation ofstandby-primary processing since the DSL may haveresponded to a particular request on behalf of a standbydevice with a no-data response. If the primary devicefailed and was unable to process the request, the DSLwould initiate the flush sequence and reâissue therequest. This time the standby device would beconsidered the primary device (since the primary devicehad failed) and would receive the request.The internal state table of the DSL 1175 keepstrack of all flushes that can affect the I/O devices andthe IOPs. The DSL, through use of the state table,permits multiple flushes to affect one or more devices,and permits multiple state transitions to occur at asingle time. This permits seamless handling of multiplerelated or unrelated component state changes.The DSL 1175 also mandates that all device statechanges originate from the IOP 1150 that owns the device.This permits the DSLs of different IOPs to have adifferent simultaneous idea of the state of a devicewithout risk that inconsistency problems will occur atthe CE Transport 1110. This also permits the DSL to befully distributed since there is no need to freezerequest or response queues while a master software entitydetermines whether a steady state operation has beenachieved.In most circumstances, the Physical DeviceProviders 1180 need not consider the state of aperipheral device or an IOP because the DSL 1175 onlysends a request to a device provider when the deviceï»¿1015202530CA 02264599 1999-03-05W0 98/12657 PCT/US97ll6218_ 33 _provider is expected to process the request. Similarly,the DSL does not consider the I/O policy associated withthe DSL does notconsider whether a disk device has a shadowed or aa particular device. For example,single-ended I/0 policy. However, the DSL does use 1/0policies to determine which providers on which IOPs willreceive a particular request to process. This permitsthe DSL to arbitrarily treat any device as shadowed,singled-ended, virtual, or primary/standby even thoughall combinations of I/O policies and device types do notnecessarily make sense.The DSL 1175 handles all device state transitions,including device failure, device activation, devicemerge, and manual device enable/disable. In addition,the DSL transparently handles all IOP state transitions,including IOP joining, IOP activation, IOP removal, andIOP graceful shutdown, as these transitions relate todevice states. The DSL also responds automatically torequests that cannot be satisfied on behalf of the deviceproviders. The DSL provides a full featured applicationprogram interface (API) set that is useable bydevelopers.The DSL 1175 provides automatic request timeoutsupport. In this regard, the DSL starts a recoveryprocess if a CE-originated request does not completewithin a specified period of time. During the recoveryprocess, the DSL determines which IOP 1150 has stalledand notifies the fault handler.The DSL 1175 can apply any I/O policy to anyphysical device. For example, the DSL can configure ahard disk as a single-ended device instead of a shadoweddevice. Similarly, the DSL can'configure, for example, aCDâROM player or a serial port as a standby/active deviceinstead of as a single-ended device.ï»¿101520253035CA 02264599 1999-03-05W0 93/12557 PCT/US97I162l8- 34 _As discussed above, the DSL 1175 initiates a flushsequence in the CE Transport 1110 of one or more CEs 1100in response to a state change in a peripheral device oran IOP 1150.to flush all outstanding activity in the messagingThe flush sequence causes the CE Transportpipeline of the corresponding CE with respect toindicated devices or Physical Device Providers 1180 andto provide notification of completed requests as a resultof the flushing.The flush mechanism provides varying granularityof requestâresponse synchronization as required by theDSL.all devices (or Physical Device Providers), or a flushThus, a systemâwide flush can be implemented forcan be implemented for a class of devices or a specificdevice.Upon completion of the flush sequence, the DSL ofeach IOP knows exactly which request-response pairs haveThe DSL uses thisknowledge to reissue, re-execute or reâtransmit anybeen processed and completed.necessary requestâresponse pairs to permit recovery fromthe transition (or stimulus) within the system thatdisrupted the steady state operation of one or moredevices. This allows the DSL to react to changes in thesystem that affect steady state operation of the devices.once a flush sequence is initiated and completed, the DSLcan determine exactly which requests or responses must bereprocessed or redirected to other devices forcompletion.Each CE Transport 1110 maintains a database ofoutstanding requests. The database includes a list ofall outstanding I/O requests in the system, eachidentified by a unique identifier called a XRN (TransportReference Number). The flush sequence is carried outaccording to the procedure 1500 illustrated in Fig. 15.First, the DSL 1175 of each IOP 1150 initiates a flushï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218-35-sequence by sending a flush request to all of the CEs1100 (step 1505). The DSL then suspends allrequest/response processing activity for the devices thatare involved in the flush sequence until the flushsequence is completed (step 1510). Each CE receives aflush request and waits for matching flush requests fromevery IOP in the system (step 1515). The CE enters theflush request into the database and tracks the flushrequest in the same way that the CE tracks operatingsystem requests. When a flush request has been receivedfrom all IOPs, the CE indicates that the first portion ofthe flush sequence is completed. The completion of thisportion of the flush sequence signifies that all activitythat was in the IOP-toâCE message pipelines has beenflushed out and processed by the CEs.Before acknowledging completion of the flushsequence, the CE first sends acknowledgments to the IOPsas to which outstanding requests have been completed(step 1520).completed by flushing responses through the IOP-to-CEThe CE sends a SWTACK (Software TransactionAcknowledgment) for each request that was completed andremoves the completed request from the database. EachSWTACK contains the request's original XRN. The XRNallows the IOPs to associate the SWTACK with the propercompleted request.After sending a SWTACK for each completed request,the CE sends a SWTACK for the flush sequence (step 1530).Messages are delivered in order through the CEâto-IOPIn some cases, I/O requests may bepipelines.message pipelines (or are reordered to reflect theirAccordingly, the flush SWTACKserves to flush the request completion notificationSWTACKS through the CEâto-IOP message pipelines. Thus,when the DSL receives the SWTACK for the flush sequencetransmission sequence).(step 1535), the DSL has already received and processedï»¿101520253035CA 02264599 1999-03-05wo 98/12657 PCT/US97/16218_ 35 _all SWTACKS for requests that were originally in themessage pipelines and have since completed (step 1525).Upon receiving the SWTACK for the flush sequence, the DSLknows the state of all request/response activity in thesystem. The DSL responds to this information by updatingthe state of the system (step 1540). Thereafter, the DSLresumes request-response activity for affected devices(step 1545), and resends any incomplete affected requests(step 1550).the new state to achieve a steady state of operation forThis reâsynchronizes the devices againstthe device or devices involved in the flush sequence.Since each flush sequence is uniquely tagged withits own XRN, more than one flush can be in progress at atime. The flush processing software in the CE Transportand the DSL abides by certain rules as to how outstandingflush sequences are processed and in what order they areacknowledged. This allows the DSL to preempt or overridepreviously-issued flush sequences due to subsequent orsecondary transitions that might occur within the system.The system software implements a freeze protocolto ensure that IOP-toâCE communications will not affectmeta time synchronization of the CEs. As previouslydiscussed, the IOPs operate asynchronously to the CEs andto other IoPs due to the asynchrony inherent in I/Odevices. For this reason, communication between the CEsand the IOPs needs to occur in a way that will notdisturb the meta time synchronization of the CEs.CEâto-IOP communication is synchronous to the CEinstruction stream and will not affect CE lockstep aslong as sufficient buffering is provided. However,IOP-to-CE communication is by nature asynchronous to theinstruction stream of each CE. Accordingly, IOPâtoâCEcommunication, if handled improperly, could affect eachCE differently and result in divergence of the CEinstruction streams. The freeze protocol serves to delayï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218_ 37 _processing of asynchronous data from the IOPs until thedata can be handled synchronously by all CEs. Inparticular, the freeze protocol serves to delayimplementation of the procedure 1300 by the CE Transports1110 until all of the CE Transports 1110 are ready toimplement the procedure 1300.The freeze protocol provides four primaryfeatures: synchronized processing of input data streamsfrom the asynchronous IoPs across the looselysynchronized CEs; synchronized time updates to the CEs; adeep-freeze mechanism that allows an IOP to synchronouslyhold all CEs in a captive state for an extended period oftime; and fault detection/diagnosis with respect to thecommunication paths between the CEs and the IOPs.The freeze protocol provides CE synchronizationThe CE Transport 1110 ofa CE may initiate a freeze cycle each time that it isusing a so-called freeze cycle.activated by a Physical Device Redirector 1105 to servicea redirected I/O operation. However, to prevent the useof excessive bandwidth in performing freeze cycles,implementations of the software may initiate a freezecycle every time that a certain number of I/O requestsoccurs or a certain number of instructions are processedwithout an I/O request. For example, the CE Transport1110 may implement a freeze cycle with every fifth I/0request or every ten thousand instructions.The CE Transport 1110 initiates a freeze cycle bytransmitting a high-priority freeze request message toall active IOPs and waiting for freeze response messagesfrom all active IOPs. Since all of the CEs areprocessing the same instruction stream, the CE transport1110 of each active CE will transmit a freeze requestmessage to all active IOPs. Each IOP receives the freezerequest messages from the CEs. When an IOP has receiveda freeze request message from all active CEs, thisï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97I16218-38..indicates that the CEs are at the same point in theirinstruction streams (i.e., in synchronization) and thatit is permissible for the CEs to process received datausing the procedure 1300. Accordingly, the IOP respondsby sending a freeze response message to all active CEs.The CEs 1100 receive the freeze response messagesfrom the IOPs and place the messages in the normal-priority message queues of the IPI Driver 1115 to providean inter-CE synchronization point for the CE Transport1110.from all active IOPs, the CEs terminate the freeze cycleFinally, after receiving freeze response messagesby transmitting freeze release messages to the IOPs.The CE Transport 1110 invokes the freeze cyclethrough the IPI Driver 1115.initiating the freeze cycle (i.e., sending the freezeThe IPI Driver responds byrequest message) and returning control to the CEThis permits the CE Transport 1110 to.continue processing in parallel with the IPI Driver'sThe CETransport performs whatever useful work it can, such asTransport 1110.handling of the freeze protocol messages.transmitting normal priority messages, or draining andprocessing messages from the receive queue of the CE1100.operating system until the CE Transport has encountered aThe CE Transport does not return control to thefreeze response message in the normalâpriority receivequeue corresponding to each active IOP.The IPI Driver 1115 attempts to complete thefreeze cycle as quickly as possible. To this purpose,the IPI Driver composes a priority message and sends themessage to all active IOPs 1150. The message carries nodata to the IOPs and merely serves as an indication thatthe freeze cycle has started. The IPI Driver starts afreeze response timeout counter after sending thepriority message to detect failure of any IOP to respond.Typically, the length of this counter is on the order ofï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97ll6218_ 39 _Thisprovides each IOP with sufficient time to completetransmission of any normal priority message that it maytwice the worst-case message transmit time.be transmitting upon receipt of the freeze requestmessage before the IOP transmits a freeze responsemessage. Typically, normal priority messages have a sizelimit of, for example, 64 kilobytes, to ensure that theworstâcase message transmit time will have a reasonablevalue.The IPI Driver 1155 of each IOP 1150 places aThistimestamp corresponds to the number of clock ticks sincetimestamp update in the freeze response message.the last freeze response message transmitted by the IOP.Thus, a CE 1100 may receive different timestamp updatesfrom different IOPs.IOP 1150 as a meta-time server to ensure that all CEsAccordingly, the CEs designate onewill update their local time clock identically.After receiving a freeze request message from allactive CEs, the IPI Driver 1155 of the IOP 1150 sends thefreeze response message as a microcode-formatted prioritymessage. Upon sending the message, the IOP initiates afreeze release timeout counter to detect failure of a CEto respond to the freeze response.The IPI Driver 1115 of each CE is interrupted toservice the incoming freeze response message, andresponds by reformatting the message into anormal-priority message and placing the message at thebottom of the normal-priority message queue. Placementof the message in the message queue completes thetime-critical portion of the freeze cycle, since itpermits the CE Transport 1110 to return control to theoperating system.once the CE's IPI Driver 1115 has received andprocessed the freeze response from the last active IOP,the IPI Driver 1115 broadcasts a freeze release messageï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218- 40 -to the active IOPs.the CE 1100.and cancels the freeze release timer upon receipt of aThis completes the freeze cycle forThe IOP 1150 receives the release messagerelease message from every active CE 1100.As noted above, time updates are provided to theoperating system by including time increments in theThe IOP IPI Drivers 1155maintain the time increments using an internal 100 usfreeze response packet.timer interrupt. only one of the IOPs is designated asthe time provider, and the time increment from that IOPis used by all of the CEs. The IoP's IPI Driver 1155transmits the delta time since the last freeze responseThe CETransport 1110 uses this value to update the timeâof-daypacket was transmitted as the time increment.clock of the operating system after a freeze responsepacket has been processed for all receive queues.A deep freeze protocol is a variation of thenormal freeze protocol and serves to suspend normalactivity between the CEs and the IOPs to allow majorsystem state transitions to occur. The deep freeze stateis invoked by the IOP software, which uses the IOPTransport 1170 to command the IOP IPI Driver 1155 toreplace the next freeze response message with a deepfreeze response message. The format of the deep freezeresponse message is identical to that of the normalfreeze response message with the exception that differentopcodes are used to designate the two types of messages.The deep freeze response carries a meta-time update likea normal freeze response. In addition, the deep freezeresponse causes the initiating IOP to disable itstransmitter without initiating a timeout counter.The IoP's IPI Driver 1155 sends the deep freezeresponse to all active CEs to inform them that a deepfreeze state is requested. The CEs respond by convertingthe deep freeze response to a normal-priority responseï»¿101520253035CA 02264599 1999-03-05W0 93/12557 PCT/US97ll6218_ 41 _message and adding the message to the normal-prioritymessage queue, with an indication that the message is adeep freeze response instead of a normal freeze response.The CEs continue to process normal freeze responsemessages from the other IOPs in the normal manner.After receiving a normal freeze response or a deepfreeze response from all active IOPs, the CE's IPI Driver1115 sends a deep freeze request message to the IOPs thatare not yet in the deep freeze state and restarts thefreeze response timer.Receipt of a deep freeze request informs the IOPsthat another IOP has injected a deep freeze cycle into(Normally, the IOP IPIs wouldEach IOPresponds to the freeze request by cancelling the freezethe current freeze cycle.have received a freeze release message.)release timeout counter, turning off the IOP'stransmitter path, sending to all active CEs a deep freezeresponse message with a meta-time update since theprevious freeze response, and restarting the freezerelease timeout counter.The IPI Drivers 1155 of the CEs receive the deepfreeze responses and insert them into the appropriatereceive queues. Upon receipt of a deep freeze responsefrom each active IOP, the CE Transport 1110 cancels theresponse timer and issues the normal freeze releasemessage to the IOPs.The CE Transport 1110, in the mean time, hasencountered a combination of normal freeze responsemessages and deep freeze response messages in the variousreceive queues. Detection of a single deep freezeresponse causes the CE transport to process beyond thenormal freeze response to the deep freeze response.Hence, the CE IPI Driver 1115 must ensure that both thefreeze responses and the deep freeze responses make theirway into the receive queues in the proper order.ï»¿51015202530CA 02264599 1999-03-05WO 98/12657-42..PCTIUS97/16218only priority messages can be exchanged betweencomponents when the system is in the deep freeze state.In addition, no new freeze cycles will be initiated bythe CE Transport while the system is in the deep freezestate.Once the major state transition requiring the deepfreeze cycle has completed, the deep freeze stateisterminated by initiating a deep freeze termination cyclewith the issuance of a deep freeze termination request.The deep freeze termination cycle typically is originatedby the IOP Transport 1170 that invoked the deep freezecycle, but any IOP Transport 1170 can invoke a deepfreeze termination cycle with the same result.The deep freeze termination request is aregisterâlevel request to the IPI Driver 1155 that causesthe driver to broadcast a deep freeze terminationpriority message to all active CEs.The IPI Driver 1115of each CE receives this message and echoes the messageto all active IOPs.termination message, the IOPs activate their transmissionpaths and exit the deep freeze state.Upon receiving the deep freezeThe freeze response timers and the freeze releasetimers account for the occurrence of errors during thefreeze protocol. If a CE's freeze response timerexpires, the CE generates a high priority system error(SYSERR) packet containing the CE's freeze status virtualregister and sends it to all active IOPs.supplement the SYSERR packet with relevant stateinformation of their own and forward the SYSERR packet tothe Fault Handlers 1165.The IoPsIf an IOPâs freeze release timer expires, the IOPgenerates a local SYSERR packet containing only state35information from the detecting IOP and passes this SYSERRpacket to the Fault Handler 1165. The detecting IOP thensends a similar SYSERR packet to all active CEs. The CEsï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218- 43 _supplement the SYSERR packet with their own stateinformation and echo the SYSERR packet back to all activeIOPs. The IOPs store additional state information in theechoed SYSERR and forward the message to the FaultHandler 1165.After generating a SYSERR packer, the IPI Driver1115, 1155 of the CE 1100 or the IOP 1150 waits for theFault Handler 1165 residing on the IOP 1150 to resolvethe error condition. The Fault Handler determines whichpath is at fault and commands the IOP IPI Driver 1155 todisable the faulty path. After disabling the faultypath, the IOP IPI Driver evaluates the state of thefreeze protocol and resumes normal processing once therequirements for such processing are met.The IOPs do not apply a timeout for the receptionof freeze requests from all active CEs. If one or moreCEs fail to send a freeze request message, or if a freezerequest message is not received for some other reason,the other CEs will eventually generate a freeze responsetimeout SYSERR packet.The CE IPI Driver 1115 state transitions requiredto service the freeze protocol, including the deep freezeThe IOP IPIDriver 1155 state transitions required to service theextension, are illustrated in Fig. 16A.freeze protocol, including the deep freeze extension, areillustrated in Fig. 16B.The system implements a message based faultnotification and reporting environment using FaultHandler 1165.system can be viewed of as including error reportingFrom an error-processing perspective, theelements and error processing elements. The errorreporting elements are any components in the system thatdetect error conditions or determine status conditionsand transmit the information to an error processingelement. The error processing elements, or faultï»¿1015202530CA 02264599 1999-03-05W0 93/12557 PCT/US97/16218-44-handlers, receive error information from the reportingelements.The collection of error messages that result fromA faulthandler uses the error messages associated with a faulta single fault are referred to as a fault event.event to identify a particular system component that hasThe identifiedA fault handlermay also take or initiate action to resolve the fault andfailed and caused the fault event.component is referred to as a callout.restore normal, if degraded, system operation.Error reporting elements may be either hardware orsoftware entities. The only requirement is that theymust be capable of transmitting error information to theerror processing elements or causing such information tobe transmitted.An error reporting element that detects a systemerror encapsulates the system error into a uniformlyformatted packet referred to as a SYSERR.reporting element then sends the SYSERR (or causes theThe errorSYSERR to be sent) to all error processing elements thatThearchitecture of the physical system is configured so thathave connectivity to the reporting element.any error condition causes generation of at least oneSYSERR packet.that, ideally, each error reporting element is connectedThe physical system is also configured soto every error processing element. This permits the lackof an error indication when one was expected to be usedas diagnostic information.A single SYSERR packet may not unambiguouslyidentify the source of a fault event in many instances.When this situation arises, the fault handlers rely onthe diagnostic information provided by SYSERR packetsfrom multiple sources to unambiguously identify thesource of the fault.ï»¿101520253035CA 02264599 1999-03-05W0 98/12557 PCT/US97/16218_ 45 -To provide connectivity between the errorreporting elements and the fault handlers, some transportcomponents are capable of echoing SYSERRS generated by areporting element to other fault handlers in the systemthat are directly connected to the transport components.Thus, fault handlers that are not directly connected to areporting element can still obtain error information fromthat element. For example, as shown in Fig. 17, a faulthandler 1700 on an IOP 1705 cannot directly receive aSYSERR packet generated by an IOP 1710. To account forthis, the IPI adapter 1715 on the CE 1720 serves as aSYSERR reflector and echoes a SYSERR produced by IOP 1710to IOP 1700 .A fault handler 1165 may also be able to probeother system components to obtain error information. Inaddition, separate fault handlers may communicate witheach other to probe the viability of the systemcomponents on which they reside, to test thecommunication paths between the system components, and,assuming that the communication paths are intact, toensure that each fault handler reaches the same diagnosisin response to a fault event.A fault handler 1165 groups encountered errorsinto sets of co-related errors, referred to as syndromes.Syndromes generally indicate a faulty component, or asuspected list of faulty components, with morespecificity than individual errors in the syndrome areable to provide.Each fault handler uses a state table to parse theincoming errors into specific syndromes. Each syndromeIf possible, the faulthandler uses an incoming error to transition the statetable to a new state.represents a state in the table.The fault handler processes errors according tothe procedure 1800 illustrated in Fig. 18. Initially,ï»¿101520253035CA 02264599 1999-03-05W0 93/12557 PCT/U S97/ 16218_ 45 _the fault handler represents each error as a canonicalerror by converting the error to a normal form thatuniquely identifies the error (step 1805). For example,the fault handler might convert the error to a tripletthat includes an error identifier that identifies aparticular error, an error target that identifies thesubcomponent about which the error is complaining, and areporting source that identifies the subcomponent thatreported the error and the path over which the error wasreceived.The fault handler then processes the canonicalerror. First, the fault handler compares the erroragainst states represented by previously establishedsyndromes (step 1810) to see if the error will transitionthe state of a previously established syndrome (step1815).fault handler performs the transition (step 1820) andIf such a transition can be performed, then theconcludes state processing for the error (step 1825). Ifthe error cannot transition any existing syndrome, thefault handler creates a new syndrome at an INITIAL state(step 1830) and determines whether the error cantransition the syndrome to a starting state of anysyndrome (step 1835). If so, the fault handler performsthe transition (step 1820) and concludes state processingfor the error (step 1825). If the error is notanticipated by a starting state of a syndrome, the faulthandler converts the error to an unsyndromed error (step1840) and concludes state processing for the error (step1845).that are unanticipated, misreported, or incompletelyAn unsyndromed error is a catchâall for errorstransmitted or received. For example, if canonicalerrors are in the triplet form described above, then acanonical unsyndromed error might consist of anidentifier for "UNSYNDROMED",corresponding to the reporting source of the bogus error,an error targetï»¿101520253035CA 02264599 1999-03-05WO 98/12657 PCT/US97/16218-47-and a reporting source of the component containing thefault handler.A set of potential callouts is associated witheach state in the state table. When a syndrometransitions to a particular state, the set of calloutsreferred to as the callout list of the syndrome.The state table is a tree-structured list ofcanonical errors.The tag on each node of the tree is anidentifier for a canonical error. Each node points to alist of other errors, or to a list of callouts, or toboth.The state table can be created from a sourcedocument that consists of groups of errors (syndromes).The syndrome text syntactically indicates whether oneIf nosuch indication is provided, then the errors are assumederror should occur before another in a syndrome.to be unordered. A list of callouts is associated witheach syndrome.The state table is created by permutating eachsyndrome's errors, converting each error to canonicalTheterminal node of each permutation points to the calloutform, and mapping the syndrome into the table.list for the syndrome. For example, suppose the sourcetable contains the syndromes with associated calloutsillustrated in Fig. 19A, and the system topology isTheFig. 19A identifies different errors reported to theassumed to be that of Fig. 17. "Error" column offault handler in a SYSERR message. In particular, a NAKerror indicates excessive retries on a transmit path, anEDC error indicates a low-level protocol failure on areceive path, and on a NAK (echoed) error is a NAK errorThe"Path" column identifies the path on which an error wasdetected,machine that reported the SYSERR (M1 indicates IOP 1705transmitted by one component and echoed by another.and the "Received from" column identifies theï»¿101520253035CA 02264599 1999-03-05W0 98/12657 PCTlUS97l16218_ 43 -(machine 1) and M2 indicates CE 1720 (machine 2)).Finally, the "callouts" column provides a list ofpossible failures in the system that could cause thecollection of errors listed in the "Error" column.Using the source table of Fig. 19A for syndrome #1to be detected by the fault handler, a NAK error reportedagainst path M1âM2 by M1, a NAK error reported againstpath M1-M2 by M1 echoed from M2,reported against path M1-M2 by M2 must all occur.and an EDC errorThenthe resulting callout is all of the entries in the"callout" column for syndrome #1.The state transition table resulting from theabove source would then appear as illustrated in Fig.19B, where the canonical form error designations are inTheabove state transition table is really a tree structure.The root is the "Initial state".initial state there are six ways of arriving at athe triplet form: error(error_path, received_from).Starting from theThere are three required errorSyndrome #1 callout.messages to arrive at a syndrome #1 callout and they canarrive in any possible order (3 factorial 6 possibleorderings). syndrome #2 callouts require two errors inany order (2 factorial 2 possible orderings).NAK (M1~M2, M1 v. M2) and an EDC (M1~M2, M2) arriving inany order without the third error NAK (M1Â»M2,M1) producesIn this case, the existence ofThus aa syndrome #2 callout.the NAK (M1~M2, M1) uniquely identifies syndrome #1.The state table is constructed by permutating setsof errors and, therefore, can become very large. This isparticularly true if the syndromes comprising the tableare particularly complex or if there are a large numberThe table sizeIn aof system components generating errors.may be reduced by logical to physical mapping.system with redundant components, errors reported againstone component are indistinguishable (up to unitï»¿101520253035CA 02264599 1999-03-05W0 98/12557 PCT/US97l162l8_ 49 _identification) from errors reported against theredundant set that includes thatthe table size can be reduced bycomponent. Accordingly,identifying errors andthe table is,reflective of errors corresponding to only one componentcallouts in logical form so that in effect,of a redundant set. With this approach, a mapping ofphysical to logical identifiers is maintained for eachsyndrome as the syndrome is constructed. When acanonical error is compared against the errors in a statethe error must be further transformed intoTheof a syndrome,a logical canonical form relative to that syndrome.logical to physical mapping may vary from syndrome tosyndrome.The size of the state table may also be reduced bysubtree folding. Many parts of the lower structure ofthe treeâstructured state table are identical to otherparts. Identical subtrees can be collapsed into a singlecopy, even if they originate from different syndromes.Performing this optimization tends to mitigate theexplosive growth of the state table as syndromecomplexity increases: larger syndromes generate moreduplicate subtrees that can be eliminated by subtreefolding.Each syndrome indicates zero, one, or morepotential faulty components, or callouts, in the system.The fault handler produces a single diagnosis bycombining these indications into an event callout list.Ideally, the callout list includes only a single elementthat unambiguously identifies a system component orfunction that failed and precipitated the error event.The callout list may, however, contain more than oneelement.The final callout list is formed by taking theintersection of the most-likely callouts associated witheach syndrome. Some of the faults indicated by aï»¿101520253035CA 02264599 1999-03-05W0 93/12557 PCT/US97/16218_ 50 _syndrome are more likely than others. For example, atransmission error may result from a transientpoint-to-point error, a common transmitter or commonreceiver fault, cable integrity problems, or incipientpower failure on one end. Incipient power failure couldwithoutcorroborating evidence, a transient fault is a moreexplain almost any erroneous behavior. However,likely explanation for a single transmission failure thanis an incipient power failure. In addition, some faultsidentify components less specifically and can be presumedto identify more specific components subsumed therein.For example, the callout for one syndrome may specify aninterconnect failure between IOP 1705 and CE 1720. Thisthe IPI adapters 1715at both ends of the interconnection as well as thecallout would include, for example,cabling 1725 between them. Another syndrome may indicatea non-specific error on CE 1720 (that is, all hardwareand software components of CE 1720, including the IPIadapter 1715 of CE 1720).lists of these syndromes clearly indicates that the IPIadapter 1715 of CE 1720 is faulty.Two callouts are combined by selecting a systemThe combination of the calloutcomponent that is common to the two callouts, but leastencompassing. For any two callouts, there is at leastone other callout that contains an intersection of thesystem components covered by the first two. The productof the combination of callouts is such a component thatcovers the least part of the system. For example,suppose that in the loosely-connected system shown inFig. 17 there is a callout against the interconnection ofIOP 1705 and CE 1720 and also against the interconnectionof IOP 1710 and CE 1720.intersect at the IPI adapter 1715 in CE 1720.could be considered to intersect at CE 1720, in general.These interconnectionsThey alsoHowever, the adapter is the callout chosen since this isï»¿101520253035CA 02264599 1999-03-05WO 98/12657 PCT/US97/16218-51-the least-encompassing component that can be identifiedas the intersection of the faulty interconnections.The approach described above for combiningcallouts results in there being only one product calloutfor each combination of two other callouts. Accordingly,a "multiplication table" could be established for use bythe fault handler in quickly establishing a new calloutfrom any two others. The multiplication table is formedby creating a callout hierarchy diagram such as isillustrated in Fig. 20. Each box in the diagramrepresents a callout, and the product of two callouts isdefined as the lowestThecallout combined withcallout (in the hierarchy) commonbetween the two. exception to this rule is that aitself is itself (i.e., all entriesare returned instead of the lowest entry). Thus, forexample:CXUux * CXUvx = RX(CEx),RX(CEx) * TX(CEx) = IPI(CEx),RX(IOPu) * Capabi1ity(IOPu) = Power (IOPu), andDevice(IOPu) * Capabi1ity(CEx) = Tuple Power,whileCXUux * CXUux = CXUux.NOTE: In the discussions that follow, the abbreviationsbeing used are:CNux-unidirectional inconnection between Machine uand Machine xCBux-bidirectional inconnection between Machine uand Machine xThe multiplication table formed by this method can bevery large, especially in systems with large numbers ofredundant components. Fortunately, it is not necessaryRather, themultiplication can be performed by following the sameto actually form the multiplication table.rules for two callouts as would be required in computingthe table. All that is required is an instantiation ofï»¿101520253035CA 02264599 1999-03-05WO 98/12657 PCT/US97/16218-52-the directed graph comprising the callout hierarchydiagram. Furthermore, redundant callouts in the systemmay be eliminated by representing the callouts in logicalform. (However, the logical callouts must be mapped tophysical callouts when the multiplication is performed.)The fault handler never combines callouts within alist.lists by âcrossâmultiplying" the lists to form a thirdInstead, the fault handler combines two calloutlist, where the third list contains all uniquecombinations of the original lists. For example,cross-multiplication of the list 1 and list 2 elements ofFig. 21A would result in the product elements illustratedin Fig. 21B.product list of Common Machine 1/2 Power Supply; MachineEliminating duplicate callouts results in a1 Power; Machine 2 Power; Machine 1 Adapter; Machine 2Adapter; and 1 6 2 Bidirectional Interconnect.Callout lists from independent sources can becombined in this same manner. For example, suppose thata fault handler 1700 exists on each of IOP 1705 and 1710in the system shown in Fig. 17. If communication existsbetween the two fault handlers, then the callout listsfrom the two fault handlers can be combined into aThis combined listwill have equal or greater specificity than the calloutcallout list for the entire system.list from each fault handler standing takenindependently.After all syndrome callout lists are combined,nonsensical callouts are removed from the list. This isdone by ranking each callout according to the scope ofThefinal callout list (i.e., the diagnosis) is formed bycompromise to system operation that it implies.selecting callouts of only the lowest rank (lower rank isarbitrarily chosen to apply to callouts of lesser scope).For example, with the callout ranking illustrated in Fig.22, the callout list of Fig. 21B can be reduced to aï»¿1015202530CA 02264599 1999-03-05W0 98/ 12657 PCT/US97/ 16218-53..single callout: Machine 1 9 Machine 2 BidirectionalInterconnect.The ranking procedure involves a probabilisticassumption. When all else is equal, callouts of lowerrank are more probably the cause of a fault event thanitis possible that an incipient power failure in the commonare callouts of a higher rank. In the above example,power supply between IOP 1700 (machine 1) and CE 1710(machine 2) evoked the syndromes, but withoutcorroborating evidence (in the form of another syndrome,or another error that would modify one of the existingsyndromes) it would be misleading to report this as acallout.callouts represent the finest granularity ofdiagnosis that the fault handler is capable of producing.Generally, callouts are too specific to be of use toservice personnel. However, they can be mapped to FieldReplaceable Units (or FRUs) that represent thesubcomponents of the system that can be identified forservice or replacement. For instance, the fault handlermay be capable of identifying a common receive port on aninterconnect adapter. The FRU corresponding to thiscallout would be the adapter card.a Fault Handler 1165 identifies afaulty FRU according to the procedure 2300 illustrated inFig. 23. Starting with the callout list from a firstsyndrome (step 2305), the Fault Handler 1165 maps thelogical callout to a physical list (step 2310). TheFault Handler 1165 then identifies this physical listwith the designation CURRENT (step 2315).If there are more syndromes to process (step2320), the Fault Handler 1165 converts the logicalcallout list for the new callout to physical callouts(step 2325), cross-multiplies the list against CURRENTIn summary,ï»¿CA 02264599 1999-03-05W0 98/12657 PCT/US97/16218-54..(step 2330) and stores the resulting product as CURRENT(step 2335).Once all syndromes have been processed (step2320), the Fault Handler 1165 reduces CURRENT to the most5 specific callout or callouts (step 2340), maps thecallout or callouts to one or more FRUs (step 2345) andconcludes the procedure (step 2350).other embodiments are within the scope of thefollowing claims.

Claims

What is claimed is:

1. A method of synchronizing data transfer to a computing element in a computer system, the computer system including the computing element and controllers that provide data from data sources to the computing element, said method comprising the steps of:
intercepting a request for data made by a computing element;
transmitting the intercepted request to the controllers; and responding, through the controllers, to the intercepted request, wherein at least a first controller responds by transmitting requested data to the computing element and by indicating how a second controller will respond to the intercepted request.

2. The method of claim 1, further comprising the step of responding, through the second controller, to the intercepted request by indicating that the second controller has no data corresponding to the intercepted request and by indicating that the first controller will respond to the intercepted request by transmitting data to the computing element.

3. The method of claim 1, further comprising the step of including, in each response to the intercepted request by a controller, an indication as to how each other controller will respond to the intercepted request.

4. The method of claim 1, further comprising the step of comparing the responses to the intercepted request for consistency.

5. The method of claim 4, further comprising the step of including in each response an indication as to how each other controller will respond to the intercepted request, wherein the step of comparing comprises comparing the indications for consistency.

6. The method of claim 5, wherein, when responses of multiple controllers include requested data, the step of comparing further comprises comparing the data for consistency.

7. The method of claim 4, further comprising the steps of:
comparing, through the computing element, the responses for consistency after the computing element has received responses from all of the controllers, and notifying the controllers, through the computing element, of the outcome of the comparison and that responses have been received from all of the controllers.

8. The method of claim 4, further comprising the step of disabling one of the controllers when the responses are not consistent.

9. The method of claim 1, further comprising the step of generating an error condition if the computing element does not receive responses from all of the controllers within a predetermined time period.

10. The method of claim 1, wherein a data source is associated with the first controller, the method further comprising the step of obtaining, in response to the intercepted request and through the first controller, the requested data from the data source.

11. The method of claim 1, further comprising the steps of:
maintaining, through the first controller, a record of a status of the second controller, and using the record when indicating how the second controller will respond to the intercepted request.

12. The method of claim 11, wherein a data source is associated with the second controller, and further comprising the steps of:
maintaining, through the first controller, a record of a status of the data source, and using the record when indicating how the second controller will respond to the intercepted request.

13. The method of claim 11, further comprising maintaining, through each controller, records of statuses of all other controllers and having controllers use the records when indicating how the other controllers will respond to the intercepted request.

14. The method of claim 13, further comprising:
associating each controller with a data source;
maintaining, through each controller, records of statuses of data sources associated with all other controllers; and having controllers use the records when indicating how the other controllers will respond to the intercepted request.

15. The method of claim 12, further comprising, when a status of the data source associated with the second controller changes, transmitting, through the second controller, an instruction to the computing element to discard responses to the intercepted request from other controllers.

16. The method of claim 15, further comprising responding, through the computing element, to the instruction by discarding responses from other controllers to the intercepted request and by transmitting to the controllers a notification that the responses have been discarded.

17. The method of claim 16, further comprising responding, through the first controller, to the notification by updating the record of the status of the data source associated with the second controller.

18. The method of claim 17, further comprising, after updating the record, retransmitting, through the first controller, the requested data to the computing element and indicating, through the first controller, how the second controller will respond to the intercepted request.

19. The method of claim 1, wherein a data source is associated with each controller and each of the controllers responds to the intercepted request by:
determining whether an associated data source is expected to process the request;
when the associated data source is expected to process the request, transmitting the request to the associated data source, receiving results of the request from the associated data source, and forwarding the results of the request to the computing element; and when the associated data source is not expected to process the request, informing the computing element that no data will be provided in response to the request.

20. A method of synchronizing data transfer to computing elements in a computer system, the computer system including the computing elements, data sources, and controllers that provide data from the data sources to the computing elements, said method comprising the steps of:
intercepting a request for data made by a computing element;
transmitting the intercepted request to the controllers; and responding to the intercepted request, through each of the controllers, by:
determining whether an associated data source is expected to process the request;
when the associated data source is expected to process the request, transmitting the request to the associated data source, receiving results of the request from the associated data source, and forwarding the results of the request to the computing element; and when the associated data source is not expected to process the request, informing the computing element that no data will be provided in response to the request.

21. A method of maintaining synchronization between computing elements processing identical instruction streams in a computer system, the computer system including the computing elements and controllers that provide data from data sources to the computing elements, wherein the controllers operate asynchronously to the computing elements, said method comprising the steps of:
at computing elements processing identical instruction streams, each computing element: stopping processing of the instruction stream at a common point in the instruction stream, generating a freeze request message, and transmitting the freeze request message to controllers;
at a controller: receiving a freeze request message from a computing element, waiting for a freeze request message from other computing elements, and, upon receiving a freeze request message from each computing element processing an identical instruction stream, generating a freeze response message and transmitting the freeze response message to the computing elements; and at the computing elements processing identical instruction streams, each computing element: upon receiving a freeze response message from a controller, waiting for freeze response messages from other controllers to which a freeze request message was transmitted, and, upon receiving a freeze response message from each said controller, generating a freeze release message, transmitting the freeze release message to controllers, and resuming processing of the instruction stream.

22. The method of claim 21, wherein the common point in the instruction stream corresponds to an I/O
operation.

23. The method of claim 21, wherein the common point in the instruction stream corresponds to occurrence of a predetermined number of instructions without an I/O
operation.

24. The method of claim 21, wherein the step of generating a freeze response message comprises including a time update in the freeze response message, the method further comprising having a computing element, upon receiving a freeze response message from each controller to which a freeze request message was transmitted, updating a system time using the time update from a freeze response message.

25. The method of claim 24, wherein the step of updating a system time comprises using the time update from a freeze response message generated by a particular controller.

26. The method of claim 21, further comprising having a computing element, upon receiving a freeze response message from each controller to which a freeze request message was transmitted, processing data received from a controller prior to receipt of a freeze response message.

27. A method of handling faults in a computer system, the computer system including error reporting elements and error processing elements, the method comprising:
detecting, through an error reporting element, an error condition and transmitting information about the error condition as an error message to error processing elements connected to the error reporting element, and retransmitting the error message, through at least one error processing element, to other error processing elements connected to the at least one error processing element.

28. A method of handling faults in a computer system, the computer system including error reporting elements and error processing elements, the method comprising:
detecting, through error reporting elements, an error condition and transmitting information about the error condition as error messages to error processing elements connected to the error reporting elements, and combining, through at least one error processing element, information from related error messages from multiple error reporting elements and using the combined information in identifying a source of the error condition.

29. The method of claim 28, wherein the at least one error processing element uses a state table to combine information from related error messages.

30. The method of claim 29, wherein the at least one error processing element represents an error message using an error identifier that identifies a particular error, an error target that identifies a subcomponent that caused the error represented by the error message, and a reporting source that identifies an error reporting element that generated the error message and a path over which the error message was received.

31. The method of claim 29, wherein the error processing element determines whether error messages are related by comparing a received error message against states representing previously received error messages.

32. A computer system including:
a computing element, data sources, and controllers that provide data from the data sources to the computing element, wherein:
the computing element is configured to intercept a request for data made by software running on the computing element and to transmit the intercepted request to the controllers; and at least a first controller is configured to respond to the intercepted request by transmitting requested data to the computing element and by indicating how a second controller will respond to the intercepted request.

33. The system of claim 32, wherein the second controller is configured to respond to the intercepted request by indicating that the second controller has no data corresponding to the intercepted request and by indicating that the first controller will respond to the intercepted request by transmitting data to the computing element.

34. The system of claim 32, wherein each controller is configured to include in response to the intercepted request an indication as to how each other controller will respond to the intercepted request.

35. The system of claim 32, wherein the computing element is configured to compare the responses to the intercepted request for consistency.

36. The system of claim 35, wherein the computing element is configured to compare the responses for consistency after the computing element has received responses from all of the controllers and to notify the controllers of the outcome of the comparison and that responses have been received from all of the controllers.

37. The system of claim 32, wherein the computing element is configured to generate an error condition if the computing element does not receive responses from all of the controllers within a predetermined time period.

38. The system of claim 32, wherein:
a first data source is associated with the first controller, and the first controller is configured to obtain the data requested in the intercepted request from the first data source.

39. The system of claim 32, wherein the first controller is configured to maintain a record of a status of the second controller and to use the record when indicating how the second controller will respond to the intercepted request.

40. The system of claim 39, wherein:
a first data source is associated with the second controller, and the first controller is configured to maintain a record of a status of the first data source and to use the record when indicating how the second controller will respond to the intercepted request.

41. The system of claim 40, wherein the second controller is configured to transmit an instruction to the computing element to discard responses from other controllers to the intercepted request when a status of the first data source changes.

42. The system of claim 41, wherein the computing element is configured to respond to the instruction by discarding responses from other controllers to the intercepted request and by transmitting to the controllers a notification that the responses have been discarded.

43. The system of claim 42, wherein the first controller is configured to respond to the notification by updating the record of the status of the first data source.

44. The system of claim 43, wherein the first controller is configured, after updating the record, to retransmit the requested data to the computing element and to indicate how the second controller will respond to the intercepted request.

45. The system of claim 32, wherein:
a data source is associated with each controller, and each controller is configured to respond to the intercepted request by:
determining whether an associated data source is expected to process the request;
when the associated data source is expected to process the request, transmitting the request to the associated data source, receiving results of the request from the associated data source, and forwarding the results of the request to the computing element; and when the associated data source is not expected to process the request, informing the computing element that no data will be provided in response to the request.

46. A computer system including:
a computing element, data sources, and controllers that provide data from the data sources to the computing element, wherein:
the computing element is configured to intercept a request for data made by software running on the computing element and to transmit the intercepted request to the controllers; and each controller is configured to respond to the intercepted request, by:
determining whether an associated data source is expected to process the request;
when the associated data source is expected to process the request, transmitting the request to the associated data source, receiving results of the request from the associated data source, and forwarding the results of the request to the computing element; and when the associated data source is not expected to process the request, informing the computing element that no data will be provided in response to the request.

47. A computer system including:
computing elements configured to process identical instruction streams, data sources, and controllers configured to provide data from the data sources to the computing elements and to operate asynchronously to the computing elements, wherein:
each of the computing elements is further configured to:
stop processing of the instruction stream at a common point in the instruction stream, generate a freeze request message, and transmit the freeze request message to the controllers;
a controller is further configured to:
receive a freeze request message from a computing element, wait for a freeze request message from other computing elements, and, upon receiving a freeze request message from each computing element, generate a freeze response message and transmit the freeze response message to the computing elements; and each of the computing elements is further configured to:
upon receiving a freeze response message from a controller, to wait for freeze response messages from other controllers to which a freeze request message was transmitted, and, upon receiving a freeze response message from each said controller, to generate a freeze release message, transmit the freeze release message to the controllers, and resume processing of the instruction stream.

48. A computer system including:
error reporting elements, and error processing elements, wherein:
an error reporting element is configured to detect an error condition and transmit information about the error condition as an error message to error processing elements connected to the error reporting element, and at least one error processing element is configured to retransmit the error message to other error processing elements connected to the at least one error processing element.

49. A computer system including:
error reporting elements, and error processing elements, wherein:
error reporting elements are configured to detect an error condition and transmit information about the error condition as error messages to error processing elements connected to the error reporting elements, and at least one error processing element is configured to combine information from related error messages from multiple error reporting elements and use the combined information in identifying a source of the error condition.