CA2097782A1

CA2097782A1 - Posted write disk array system

Info

Publication number: CA2097782A1
Application number: CA002097782A
Authority: CA
Inventors: Randy D. Schneider; David L. Flower
Original assignee: Compaq Computer Corp
Current assignee: Compaq Computer Corp
Priority date: 1992-06-05
Filing date: 1993-06-04
Publication date: 1993-12-06
Also published as: EP0573308A2; EP0573308A3; US5408644A

Abstract

ABSTRACT
POSTED WRITE DISK ARRAY SYSTEM

A posting memory used in conjunction with a drive array to increase the performance of fault tolerant disk array write operations. When the posting memory flushes dirty data back to the disk array, the posting memory coalesces or gathers contiguous small write or partial stripe write requests into larger, preferably full stripe writes. This reduces the number of extra read operations necessary to update parity information.
In this manner, the actual number of reads and writes to the disk array to perform the transfer of write data to the disk array is greatly reduced. In addition, when the posting memory is full, the posting memory delays small, i.e., partial stripe writes but allows full stripe writes or greater to pass directly to the disk array. This reduces the frequency of partial stripe writes and increases disk array performance.

Description

~ ~ $ ~

~STED ~RIT~ DI5K ARRAY SYS~EM

The present invention i6 directed toward a ~ethod and apparatus for improving the performance Df a disk Array ~ystem in n computer system, and more particularly to a posted write memory used in conjunction with a disk array system to increase the efficiency of the disk array system.
Personal computer systems hsve developed over the years and new uses are being disc~vered daily. The uses are varied and, as a result, ~ave different requirements for various su~systems forming a complete computer system. With the increase~ performance of computer ~yste~s, mass storage subsystems, such as fixed disk drives, play an increasingly important role in the transfer of data to and from the computer system. In the past few years, a new trend in storage subsystems, referred to as a disk array ~ubsystem, has emerged for improving data transfer performance, capacity, and reliability.
A number of reference articles on the design of disk arrays have been published in recent years~ These include "Some Design Issues of Disk ~rrays" by Spenc~r Ng April, 1989 IEEE; "Di~k Array Syste~s" by Wes E.
Meador, April, 1989 IEEE; and "A Case ~r R dund~nt Arrays of Inexpensive Disks (RAID) 1l by D. Patterson, G.
Gibson and R. Catts, Report No. ~CB/CSD 8~/391, December, 1987, Computer Science Division, University of California, BerkeIey, California.

''' ' ' ::
: ' ' ": ' ' . .' ; ,.` '. '~ . : :.

One reason for building a disk array subsystem is t~ create a l~gical device that has a very high data transfer rate. This may be ~ccompli~hed ~y ~ganglng"
~ultiple standard disk ~rives together ~nd transferring data to Dr from these drives in parallel. Accordingly, data is stored "across" each of the disks comprising the disk array 50 t~at eac~ disk holds a portion of the data comprising a data ~ile. If n ~rives 2re ganged t~gether, then ~he effective data transfer rate may be increased up to n times. This technigue, Xnown as ~triping, ~riginated in ~he superc~mputing environment where the transfer Df large amounts ~f data to and from secondary st~rage is a ~reguent requirement.
In striping, a sequential data block is broken into 6esments of a unit length, ~uch as 6ector size, and ~equential ~egments ~re written to 6equential disk dri~es, not to sequential l~cations on a ~in~le disk drive. The combination of corresponding ~equential data segments across each of the n disks in an array i~
referred to as a ~tripe. The unit length or amount of data that is 6tored "across" each individual disk is referred to as the stripe size. The stripe ~ize affects data transfer characteristic~ and access times and is yenerally chosen to optimize data transfers t~
and ~rom the disk array. If the data block is l~nger than n llnit lengths, the process repeats for the next stripe location ~n the respective disk drives. With this ~pproach, the n physical drive~ become a single logical device.
one technique that is used to provide for data protection and recovery in disk array su~systems is referred to as a parity ~cheme. In a parity ~cheme, data bl~cks being written to variou~ drive~ within the array are used and a known XCL~SIVE-OR (XOR) technique is used to create parity inf~rmati~n which is w~itten - , :, . . . .

,, ,, .

$ ~

to a reserved or parity drive wit~in the array. Th~
~dvantage of a parity scheme i6 that it ~ay be used to minimize the amount of data storage dedicated to data redundancy ~nd recovery purposes within th~ ~rray. For S example, Figure 1 illustrates n traditional 3+1 ~apping scheme wherein three disks, disk 0, disk 1 and disk 2, are used for data ctorage, ~nd o~e di~k, disk 3, i6 used to store parity inf~rmation. In Figure 1, eac~
rectangle ~nclosing a number or the letter ~p" coupled with a number corresponds to a ~ector, which is preferably 512 bytes. ~s s~own in Figure 1, eAch complete stripe uses four sectors from each of disks 0, 1 and 2 for a total of 12 sectors of data storage per disk. ~ssuming a standard sector ~ize of 512 bytes, the stripe size of each of these disk ~tripes, which is defined as the amount of storage allocated to a stripe on one of the disks comprising the stripa, i6 2 kby~es (512 x 4). Thus each complete stripe, which includes the total of the portion of each of the disks allocated to a stripe, can store 6 kbytes of data. Disk 3 of each of the stripes is used to store parity information.
However, in addition to the advantages of parity techniques in data protection and recovery, there are a number of disadvantages to the use of parity fault tolerance tec~niques in disk array ~yste~s. One disadvantage is that traditional operating systems perform many small writes to the disk subsystem which are of~en smaller ~han the stripe of the ~isk array, referred to as partial stripe write operations. As discussed belsw, a large number of partial ~tripe write operations in conjunction with a parity redundancy sc~eme considerably reduces disk array performance.
Two very popular operating sys*ems used in personal computer systems are MS-DOS (Micros~ft disk ~ , . - .

2~7$~

operating ~ystem for use with IBM compatible personal computers) nnd UNIX. ~S-DOS, or ~ore ~imply DOS, i6 a ~ingle threaded ~oftware application, ~eaning ~hat it can only perform one operation at a time. Therefore, when a host such as the ~y~tem processor or a bus master performs a disk array write operation in a DOS
system, the host i6 required to wait for a completion signal from the disk array before it i5 allowed to perform another operation, such ~s sending further write data to the disk array. In ~ddition, in DOS, the quantity of data that can be transmitted to a disk array is relatively cmall, comprising a partial stripe write. Therefore, limitations imposed by DOS
con~iderably reduce disk array performance because a large number of partial tripe writes are performed wherein the host must continually wait for the completion of each partial stripe write before instituting a new write operation.
UNIX also includes certain features which reduce disk array perfo~mance. Nore particularly, UNIX uses small data structures which represent the structure of files and directories in the free space within its file system. In the UNIX file system, this information is kept in a structure called an INODE (Index Node), which is generally two kilobytes in size. These structures are updated often, and since they are relatively small rompared with typical data stripe sizes used in disk arrays, these result in a large number of partial stripe write operations.
Where a complete ~tripe of data is being written to the array, the parity in~ormation may be generated directly from the data being written to the drive array, and therefore no extra read of the disk stripe is required. However, as mentioned ab~ve, a problem 3S occurs when the computer writes only a partial stripe . .
, :
: :. ... ~ - ,:
~: .. . . .
,, .
, ~77~
.

to the disk array because the di~k array controller ~oes no~ have ~ufficient information from the data to be written to compute parity for the complete stripe.
~hus, partial stripe write cperations generally require that the data stored on the disk array first ~e read, modi~ied by the process activ~ on the host sy~tem ~o generate new parity, and written back to t~e same address on the data disk. This operation consists of a data disk READ, ~odification ~f the data, and a data disk WRITE to the same address. In addition to the time required to perform the actual ~perations, it will be appreciated that a READ operation followed by a WRITE operation to the same sector on a disk results in the loss of one disX revolution, or approximately 16.5 milliseconds for certain types of hard disk drives.
Therefore, in summary, when a large number of partial stripe write operations occur in a disk array, the performance of the disk subsystem is ~eriously impacted because the data or parity information currently on the disk must be read off of the disk in order to generate t~e new parity information. ~ither the remainder of the stripe that is not being written must be fetched or the existing parity information for the stripe must be read prior to the actual write of the information. This results in ex~ra revolutions of the disk drive and causes delays in servicing the request. Accordingly, there exists a need for an improved method for performing disk array WRITE
operations in a parity fault tolerant disk array in order to decrease the number of partial stripe write operations.
One technique for improving disk array system performance in general is the us~ of disk caching programs. In disk caching programs, an amount of main memory is utilized as a cache for disk data. Since the - - -, . ~: .:. : ,.............. :, . . ;
~: : :: :- :: ;: -.;. -. . ~ - .:

, .2~ 3~

cache memory is significantly faster than the disk drive, greatly improved performance results when the desired data i6 present in the cache~ Nhile di~k caching can ~e readily ~pplied to read operations, it is 6ignificantly more difficult to utilize with write operations. ~ techni~ue known ns write posting ~aves the write data in the cache and returns an operation complete indicator before the ~ata is actually written to the disks. Then, durin~ a less active time, the data i5 actually written to the disk.
Background on write posting operations in computer systems is deemed appropriate. An example of write posting ~ccurs when a microprocessor per~orms a write operation to a device where the write cycle must pass through an intermediary device, such as a cache system or posting memory. The processor executes the write cycle to the intermediary device with ~he expectation that the intermediary device will complete the write operation to the device being accessed. If the intermediary device includes write posting capability, the intermediary device latches the address and dat2 of the write cycle and immediately returns a ready ~ignal to the processor, indicating that the operation has completed. If the device being ~ecessed is ~urrently performing other operations, then the intermediary device, i.e., the posting memory, need not interrupt the device being accessed to complete the write operation, but rather can complete the operation at a later, more convenient time. In addition, if the device being accessed has a relatively slow access time, such as a disk drive, the processor need not wait for the access to ~ctually complete before proceeding with further operations. In this manner, the processcr is not delayed by the slow ~ccess times of the device being ~ccessed nor is it required to interrupt ~ther :: - , . .: , - : .~, : ..
~: :
. , - ~ . , - : ,.

2a~77~ `

operations of the device being accessed. The data that has been written to the inter~ediary device or posting ~emory that has not yet been written to the device being accessed is referred to as dirty data. Data stored in the posting memory that has nlready been written to the device being accessed i5 referred to as clean data.
Therefore, when ~ posting memory ~s used in conju~ction with a disk array ~ubsystem, when a write request is received ~rom ~ host, i.e., a processor or bus ~aster, the data is written immediately to the posting memory and the host is notified that the operation has completed. Thus, the basio principal of a pos ed write operation is that the ~ost r~ceives an indication that the requested data has been recorded or received by the device being accessed without the data actually having been received by the de~ice. The advantage is that the data ~rom the host can be stored in the posted memory much ~ore quickly than it can te recorded in the device being accessed, such as the disk array, thus resulting in a relatively quick response time for the write operation as perceived by the host.
However, if the write operation is a partial stripe write, which is usually the case, then additional reads are still necessary to generate t~e required parity information when the data is transferred ~rom the p~sting memory to the drive array. Therefore, a ~ethod and apparatus is desired to ~fficiently implement a posting memory in conjunction with a drive array system to reduce ~he percentage of partial ~tripe write operations as well as reduce the number of overall operations to the drive array and increase disk array perfor~ance.
~ackground on other data integrity ~ethods is deemed appropriate. Other methods that are used to . . . -..
:~: . . ,. :
. , ,: -: , : , ~ , . -., . . ~ . .

~77~2 provide data pr~tection and recov2ry are ~irroring technigues for disk data ~nd battery backup techniques f~r semiconductor ~emory. ~irroring techniques require that one disk drive be set aside for the storage of data as would normally be done, ~nd a ~econd equivalent disk drive is used to "mirror" or identically back up the data stored. This method insures that lf the primary disk drive fails, the secondary or ~irrored drive remains and can be used to recover the lost data.
Battery backup techniques provide that if ~ power loss ~ccurs to a memory system, the battery is enabled to maintain power for a period of ti~e until an operator can ensure an orderly shutdown of the ~ystem.

The present invention relates to a posting memory in a disk array system that reduces the number of partial stripe write operations and increases the performance of disk array operations. The posting memory is coupled to both the disk array controller and the disk array. When a host such as a pr~cessor or bus master performs a write operation to the disk array, and the postiny memvry is not full, the write data is written into the posting memory, and the posting memory immediately returns a ready signal to the h~st. Due to seek delays and rctational latency of the disk array system, the storage timP of data in the posting memory is much guicker tha~ would be the stcrage in the disk array system. Therefore, a completion message is sent back to the host much ~ore quirkly ~han would occur i~
the data were written directly into the disk array, thus enabling the host t~ continue other operations.
When the p~sting ~emory writes ~r flushes the dirty data back t~ the disk array, the posting mPmDry c~alesces ~r gathers c~nti~uous small write ~r partial : : .. - ;
.: , .
-. ~ . . .

. . . : , 7 ~ ~
g ~tripe write reguests into l~rger, preferably full stripe writes, thus reducing or eliminating the extra read operations necessary to update parity information.
In this manner, the actual ~umber of reads and writes S to the disk array to perform the transfer of write data to the disk array i~ greatly reduced.
When the posting memory i6 full ~nd a ~mall write, preferably a partial stripe write, occurs to the disk ~rray, the disk ~rray controller delays posting or ~toring the write request, re~uiring that the write reguest be stored in the postin~ memory ~t a later time when the memory is not full. However, i~ a large write, preferably a full stripe or greater, ~ccurs, a~d the posting memory is full, the disX array controller allows the write to pass directly to the disk array. A
full stripe wlite operation is allowed to pr~ceed because this operation will not require the preceding reads associated with a partial stripe write operation, and thu~ allowing this operation to proceed will not hamper disk array performance.
Therefore, a method and apparatus for increasing the performance of a disk array system is disclosed.
The posting memory allows a completion signal to ~e immediately returned to the host, thus enabling the host to continue operations much more quickly than if the data were actually written to the much slower drive array. In addition, partial stripe writes withi~ the posting memory are coalesced or gathered into full stripe writes, if possible, before the writes are actually performed to the disk array, thus reducing the frequency of partial s~ripe writes and i~crçasing disk array performance. If the posting ~emory is full, then partial stripe writes are delayed ~or a period of time to prevent these operations ~rom being perfo~med directly to the array. However, write operations that "

7 ~ ~

~ 10--compri~e a full ~tripe write or greater are allowed to pr~ceed to the drive array when ~he posting memory is ~ull ~s these operations do not reduce ~ystem performance.

A better understanding Gf the present invention can be obtained when the following detailed description of the specific embodiment is considered in co~junction with ~e following drawings, in which:
Figure 1 is a prior ~rt diagram of a traditional 3 + 1 disk array mapping scheme having ~ uniform stripe size;
Figure 2 is block diagram of a disk array system incorporating the present invention;
Figure 3 is a block diagram of the transfer controller of Figure 2;
Figure 4 is a block diagram of the posted write RAM of Figure 2;
Figures 5 s a schematic diagram of the power control logic of Figure 4;
Figure 6 illustrates a command list generated by a host to the drive array;
Figures 7A-7C, 8A-8F, 9A-9D, 10, and llA-llG are flow chart diagrams illustrating various software tasks operating in the disk array ~ystem ~f Fi~ure 2; and Figure 12 illustrates t~e various types of request lists that can be ~ubmitted to the task in Figures 8A-8~.
3~

Referring now to Figure 2, a disk array ~ystem including a disk array controller D incorporating the present inv~ntion and disk ~rray A, is shown. The disk array ~ystem is preferably incorporated into a ~ost 2~7 ~2 system (not shown) which transfers and receives data to and ~rom the drive array ~y6te~. The di~k array controller D ha6 a local proces60r 30, preferably a V53 manufactured ~y NEC. The loc~l proces~or 30 has a addre~s bus UD, data bus UD and control outputs UC.
The data bus UD is connected to a transceiver 32 whose output i~ the local dat2 bu6 LD. qhe address bus UA is connected to the inputs of a bu~fer 34 who~ outputs are al60 connected to ~he the local dta bus LD. The local processor 30 has ass~ciated with ~t random access ~emory (RAM) 36 c~upled ~ia the data bus ~D and the address bus UA. The RAM 36 i~ connected to the processor control bus UC to develop proper timing signals. Similarly, read only memory (ROM) 38 is ~5 connected t~ the data bus UD, the processor address bus UA and the processor control bus UC. Thus the l~cal processor 30 has its own resident memory to control its cperation and ~or its data ~torage. A programmable array logic (PAL) device 40 is connected to the local pr~cessor control bus UC and the prscessor address bus UA to develop additional control ~ignals utilized in the disk array controller D.
The local processor address bus UA, the l~cal data bus LD and the local processor contr~l bus UC are also connected to a bus ~aster integrated contr~ller ~BMIC~
42. The BMIC 42 serYes the ~unction of interfacing the disk array controller D with a standard bus, such ~s the EISA ~r ~CA bus and actl~g as a bus ~aster. In the preferred emb~di~ent the BMIC 42 is interfaced with the EISA bus and is the 82355 provided by Intel. ~hus by this connection wit~ the local processor buses UA and UC and the local data bus LD, the BMIC 42 can interface with the local processor 30 to allow data and control inform~tion to be passed betwePn the host system and the l~cal processor 30.

:: . . ;" , , , 2~

Additionally, ~he local data bus LD nnd local proces~or control bus UC ~re ~onnected to n tr~nsfer cont.roller 44. ~he transfer controller 44 i6 generally ~ 6pecialized, ~ultichannel d~rect ~emory acce6s tDMA) controller used to tr~nsfer data between the transfer buffer RAM 46 and the variou~ other deYices pre~ent in the disk array c~ntroller D. For example, the transfer c~ntroller 44 16 ~onnected to the 8MIC 42 by the BMIC
data lines BD and ~he BMIC control lines BC. Over this inter~ce the transfer controller 44 can transfer data from the tr~nsfer buffer RAM 46 through the transfer controll~r 44 to the ~MIC 42 if a raad operation is reguested. If a wTite operation is requested data can be transferred from the BMIC 42 through the transfer controller 44 to the transfer buffer RAM 46. The transfer controller 44 can then pass this information from the transfer buffer RAM 46 to disk array A or to a posting memory referred to as the posted write RAM 71 (PW ~M).
The transfer controller 44 includes a disk data bus D~ and a disk ~ddress and control bus DAC. The disk data bus DD is connected to transceivers 48 and 50. The disk address and control bus DAC is connected to two buffers 64 and 66 which are used for control ~ignals between the transfer controll~r ~4 and the disk drray A. The outputs cf the transceiver 48 and the buffer 64 ~re connected to two disk dri~e port connectors 52 and 5~. These port connectors 52 and 54 are preferably developed accsrding t~ the integrated device interface utilized for hard disk unit~. Two hard disks 56 and 58 can be connected to each connector 52 or 54. In a ~imilar ~ashion, two c~nnector~ 60 and 62 are connected to ~he outputs of the transcei~er 50 and the buffer 66, and two hard disks can be connected to each of the connectors 60 and 62. Thus, in the .

2~7~2 preferred em~odiment, 8 ~i~k drives can be connected or coupled to the trsn6fer controller 44 to form the di~k arr~y A. In thi~ way, the various data, ~ddress snd control signal can pa~s between the tr~n~er controller ~4 and the particular di~k drives 56 and 58, for example, ~n the disk ~rray A. In the preferred emb~diment, ~he disk array A ha6 a stripe ~ize of 16 sector6, wherein each sec or compri~es 512 bytes o~
data. In addition, in one ~mbodiment, the 8 drives in the array A ~re ~rganized as two four-driYe ~rrays, for example, two 3 1 1 ~apping ~ch~mes ~s previously described in Figure 1. ~ther ~mbodiment~ ~uch as distributed p2rity, etc., ~ay al60 be implemented.
A programmable array logic tPAL) device blocX 67 is connected to the disk address and control bus DAC
and r~ceives inputs fro~ a control latch (not shown).
The PAL block 67 is used to ~ap in the PW RAM 71 ~s a disk drive as indicated by the control latch and map out zn actual disk drive.

transceiver 73 ~nd a ~uffer 75 are connected between the di~k data bus DD and the disk address and control bus DAC, respectiYely, and ~he PW RAM 71 to allow dat~
and control infor~ation to be p~ssed between the transfer controller 44 and t~e P~ RAM 71.
In the preferred embodiment 2 compati~ility port controller 64 is al~o conn~cted to t~e EISA bu~. The compatibility port controller 64 is connected to the transfer controller 44 over the compatibility data lines ~B and the compatibility con~rol lines CC. The compatibility port controller fi4 is provided co that - . ~ .... , . . ~ , -:: , "
,, . ' 2~,7~

~oftware which w~s written for previous computer cystems which do not have a disk array con~roller D ~nd its BMIC 42, which ~s a~dressed over a EISA ~pecific cpace ~nd ~llows very high throughput~, can operate without requiring rewriting o~ ~he ~oftware. T~us the compatibility port controller 64 emul~tes ~he various control port6 previously utilized in interfacing with hard di6ks.
The transfer contr~ller ~4 is itself compri~ed of a ~eries of separate circuitry blocks ~s ~hown in Figure 3. There are two ~ain units in the transfer controller 44 and these are the R~M controller 70 and the disk controller 72. The RAM controller 70 has an arbiter to control which o~ the various interface devices have access to the RAM 46 (Fig. 2~ and a multiplexer so that data ~an be passed to and from the.
~uffer RAM 46. Likewise, ~he disk controller 72 includes an arbiter to determine whic~ of the various devices has access to the integrated disk interface 74 and includes ~ultiplexin~ capability to allow data to be properly transferred back and forth through the integrated disk interface 74.
There are basically seven DMA ch~nnels present in the transfer controller 44. One DMA channel 76 is assigned to cooperate with the BMIC 42. A .~econd DMA
channel 78 is designed to ~ooperate with the compatibility port c~ntroller 64. These two devices, the BMIC ~2 and the compatibility por~ controller 64, are coupled only t~ the RAM 46 through t~eir appropriate DMA channels 76 and 78 and the RAM
controller 70. The BMIC 42 and the compatibility port controller 64 do not have direct access to t~ie integrated disk interface 74 and the disk array A. The local proces~or 30 is connected to the RAM controller 70 through a local processor RAM channel ~0 and ~ ;

2~77~

connected tD t~e disk controller 72 ~hrou~h a local processor disk channel 82. Thu~ the loc~l proce6sor 30 connects to ~oth ~he buffer RAM 46 and the disk ~rray A
as desirad.
Additionally, there are four DMA disk ch~nnels 8q, 86, 88 and 90. The~e four ch~nnel6 84-90 ~llow in~ormation to be independ~ntly ~d simultaneously passed between ~he di~X arr~y A and ~he RAM 46. It is n~ted t~at the fourth DMA/di6k channel 90, pre~erably channel 3, also ~ncludes XOR capability 80 that parity operations can be readily perfor~ed in the transfer controller 44 without requiring computations by the local processor 30.
The computer 6ystem and disk ~rray subsystem described below represent the preferred e~bodi~ent of the present invention. It i6 ~lso contemplated that other c~mputer ~yste~s, not havin~ the capabilities of the system described below, may ~e used to practice the present invention.
Referring now to Figure ~, a block diagram of ~he posted write me~ory ~1 is shown. A ycle control ~lock 120 receives the various 6ignal6 ~rom the buffer 75 which are provided from the DAC bus. ~hese ~re the ^-`
signals sufficient to determine if particular ~ycles, ~uch ~s the readlwrite cycles, are ~ccurring and to return the vari~us error, interrupt and other ~ignals.
The cycle control 120 provides ~utputs to an address counter 122, various control latc~es 124, a parity generator/detector transceiver 1~6 ~nd to data latches 128. The address counter 122 is provided to allow latching and ~uto incrementing capabilities to allow block operations with the transfer contr~ller 44 to cccur e3sily. The contrvl latches 1~4 2re provided to allow the local processor 30 to ~et various ~tates and conditi~ns of the posted write memory 71. The parity . . . : .

. ; , . ~ : .. .

generator/detector transceiver 126 i6 u6ed to provide the pnrity detection for write ~peration~ and to develop ~n internal dat~ bus in the poste~ wTlte ~emory 71 re~erred to as the INTDATA bus.
S ~he devic~s ~ddress counter 122, control ~atches 124, ~nd the parity generator/~etector transce~ver 126 ~re connected to tbe INTDATA bu6. The ~utputs ~f the . ~ddress counter 122 a~d Df the control latches 124 are provided to ~n Address ~ultiplexer and control block ~30. The address multiplexer and control block 130 ~lso receives outputs from the cycle control 120. The address multiplexer an~ control block 130 provides the output enable, write enable, row address select ~RAS) and column address 6elect (CAS) signal~ to a dynamic random access memory (DRAM) ~rray 132 ~nd provides the memory addresses to the DRAM nrray 132 over the ~A bus.
The data latches 128 provide t~e data to and from the DRAM array 132. The DRAM array 132 preferably is comprised of a mirrored bank of dynamic random access memories which also include 6ufficient capaci~y for parity checking. A power control block 134 is connected to a ~eries of batteries 136 to provide battery power and to deter~ine whether the batteries 136 or the power provided by the system is provided to the DRAM array 132.
~eferring now to Figure 5, the power control b1OGk includes the positive terminal of the batteries 136 connected to the ansde of a Schottky diode 141. The cathode of the diode 141 is connected to the input of a switching regulator 142~ A ~econd input of the switching regulator 14~ receives a signal referred to as POWER G~QD, which indicates, when high, that the ~5 -volts being received by tbe disk controller D i5 satisfactory. The output of the switching regulator 142 is provided t~ a power input of the DRAM array 132 , ..
~ , , :, ' : .

`2 when t~e system power is not g~od. A ~5 volt~ge signal ~nd the P~WER GOOD signal are connected to ~witch block 156. The output o~ the ~witch block 156 i6 connected to the output of ~he 6witching regulator 142, which is provided to the DR~M arr~y 132 in the PW RAM 71.
A ~5 volt power ~upply 143 ~B connected through a resistor 144 to the anode of a Schottky di~de 146 whose cathode is connected between the positive terminal of the batteries 134 and tbe anode of t~e diode 141. The POWER_GOOD signal is connected to the ga~e ~f an N-channel enhancement MOSFET 148 whose drain ~ connected to the negative terminal of the batteries 136 and whose source is connected to ~ l~gical ground.
The negative terminal of the batteries 136 is also connected to the drain of an N-channel enhancemPnt MOSFET 150. The gate input of the MOSFET 150 receives ;~`.'t a ~ignal fro~ a circuit referred to ~s bat~ery voltage good 154, which monitors the battery voltage to determine if the batteries re fully disc~arged. The source of the MOSFET 150 is conne~ted to *he drain of an N-channel enhancement ~OSFET 152 whose source is : -connected to ground. The gate input cf the MOSFET 152 receives a signal referred to ~s BAT ON, which indicates when high that khe batteries 136 are available for backup p~wer, When adequate power is being provided to the DRAM
array 132, the +5 voltage source 143 is provided and the P~WER GO~D signal ic asserted, and the batteries 136 are charged from the 5 vDlt source 143 through the MOSFET 148. Also, when main power is operating, the switch block 156 provides the voltage source to the DRAM array 132. When main power to the DRAM array 132 is lost, the POWER GOOD signal goes low an~ the ~5 volt source 143 disappears. If the BAT ON signal is asserted, indicating the batteries 136 are available 2~77~2 ~nd the ~attery voltage 73 sati~factory, a ground is provided to the negative terminal of the b~tteries 136 ~uch that the batteries 136 provide power ~o the switch~ng regulat~r 1~2 ~nd out to the DRAM array 132.
The negated POWER GCOD signal to the switching regulator 142 rauses the ~witching regulator 142 to provide the proper v~ltage to the DRAM ~rray 132. If the BAT ON signal i6 low, indicating the batteries 136 are not ~vailable or the battery v~ltage i6 too low, the DR~M array 132 is not powered when the system power is lost. This extends the life of the batteries.

The data contained in DRAM array 132 in the PW RAM
71 is prefera~ly organized in a ~anner ~imilar to a 15-way cache and utilizes a true last recently used (LRU) replacement algorithm. HowevPr, the cache snly caches on write operations and does not cache reads ~rom the disk array A. In at least this manner, operation is different from a conv~ntional dis~ cacbe, which primarily caches reads. When the host generates a read operation or command list to the drive ~rray A, the c~ntroller D fir~t checks to see if the requested data resides in the DRAM rray 132 in the PW RAM 71. If so, the data is returned from the PW RAM 71, and the drive array A is not accessed. If the data d~es not reside in the PW RAM 71, then the data is retri~ved from the drive array A. If the PW RAM 71 were operating ~s a true cache 6yste~, the data obtained from the drive array A would be written into the DRAM array 13~ in the PW RAM 71 as well as being provided to the resuesting host. ~owever, the PW RA~ 71 in the preferred embodiment d~es not cache or store data on read ~isses and theref~re does not ~perate as a true cache system.

-- : . . :: . : - :.. : . ~ , -- . , : ~ . : , : ' ~ : : : '~: :: :

::::: ~ :

2~7~

When the PW RAM 71 ~ n~t ~ull, ~he PW RAM 71 ~tores or posts wr ite data ~rom the ho~t into the DRAM
array ~32. Th¢ PW RAM 71 write~ or flushes thi8 data to the drive array A at ~ later time. The PW RAM 71 also coalesces or gathers contiguous ~mall or partial ctripe wr~tes lnto full 6tripe writes before flushing the write data to ~e drive array A according to the present ~nvention. This reduces the number of partial stripe writes, thus lncreasing di6k nrray performance.
When the PW RAM 71 is full, write operations ~maller than a given ~ize, preferably l~ss than a full ~tripe write, are delayed from being posted until the PW RAM
71 has ~ufficient r~m avail~ble. However, write requests greater than a given ~ize, preferably a full stripe write or greater, are allowed to proceed directly to the drive array A.
In the preferred emb~diment, the disk array A may be partitioned $nto a plurality o~ logical volumes.
Also in the preferred ~mbodiment, the capability o~
~osting writes may be e~abled on a lo~ical volume basis with a command referred to ~s the "set p~sted writes"
command from the BMIC 42. When ~ plurality of l~gical volumes are established, the PW RAM 71 is preferably comprised of data areas that are available to each of the logical volu~es on a first come, first ~erYe manner. In the preferred e~b~diment, each ~f the logical vol~mes that are configur~d for posted writes are reyuired to have the same ~tripe si~e or distribution factor.
The DRAM array 132 in t~e PW RAM 71 includes an area where write data is 6tored and an area in which corresp~nding status informati~n regarding the various write data is ~tored. ~he write data area include a pl~rality of lines of data where each line comprises 16 sectors of data, ~s previously noted, 16 sectors is ,, -: ~ : , ,-, - , , , . :

2~7~:~

also preferably the ~tripe ~ize in the disk ~rray A.
There~ore, one l~ne ~n the PW RAM 71 h41d6 a portion of a strlpe xesidinq on one disk in the arr~y A. Three lines of data in the PW RAM 71 oompri~e ~n entire ~tripe of data ~n the di~k ~rray A. ~ previously noted, the PW RAM 71 does not store parity data.
The local pr~cessor RAM 36 ~tores 6tatus information regarding respective lines in the PW RAM
71. The status information ~tored in the ~AM 36 includes a tag which ~tores th2 upper addr~ss bits o~
the data in the PW RAM line. As in a cache memory, the l~wer ~ddress bits ~re dictated by the location of the line i~ the respective way. The status information also includes a 16 bit word referred to as the dirty word wherein each bit indicates whether the corresponding sector in the line is clean or dirty, and a word referred to as the valid word wherein each bit indicates wh2ther the corresponding ~ector in the line is valid or invalid. The ~tatus information further includes information regarding whether the line is locked and the reason for which the line is locked. A
line is locked wben it is currently being flushed to the disk array A ~r i~ waiting t~ r~ceive write data from the host. Other types of ~tatus information in the RAM 36 associated with each line include the destination of the associated data in the line, including tbe l~gieal v~lume, the drive, and the l~cation in the resp~ctive ~tripe. The status information area in the PW RAM 71 stores a copy ~f certain 6tatu~ information held in the l~cal processor RAM 36, in~luding the t~g, volume number, and dirty word~ As discussed further below, a low priority software task continually sc~ns the status inf~mation in the RAM 36 associ ted with each line, determines whether the line contains dirty data, coalesces partial -, .: , , 2~77~2 stripe writes into ~ull ~trip~ writes, i possible, and flushe~ the write data to the drive array A. The operation of thi6 t*6k is di6cu~6ed ~ore fully below.
Use o~ the PW RAM 71 $ntroduces ~nother level of 5 catastrophic ~ailure into the disk array ~ystem. For example, prcblems ~ay ari~e if the primary power fails while there is dirty data within ~he PW RAM 71.
Therefore, as previously di6cu~sed, the PW RAM 71 includes battery back-up tec~nique~ wherein ~atteries are ~vailable to provide power i~ the ~ain power fails.
L~gic ~ also included to re~tore dirty data ~rom the PW RAM 71 upon pow~r-up ~fter the failure of ~ previous ~ystem.

In æddition to the battery backup techniques discussed above, the PW RAM 71 includes optional data integrity techniques 6uch AS mirroring and parity checking. Parity chesking all~ws determination of errors pri~r to the actual ~torage of dat3 on the drive~ When an error is obtained, the mirroring feature allows ~ccess to an exact copy of the data ~o that valid data is ~till available for 6torage ~y the disk drive. The c~mbination of battery backup, ~irroring, and parity checking provides the PW RAM 71 with ~ufficient data ~ecurity to ~llow use in even very critical environm~nts.

: '. ~ ~ , ' ;, - : - :
. . ~ , .
.:

2~77~,2 The ~ethod og the present invention i6 preferably implemented as a number of npplication ta6k6 running on the local processor 30 tFigure 2). Because of the nature of interactive input/output operations, it is impractical or the present invention to operate as a single batch t~sk on the loc~l prQcessor 30.
~ccordingly, the local proces~or 30 utilizes a real time multit~sking 6ystem which permit6 ~ultiple tasks to be addressed by ~he lQcal proces~or 30, including the present invention. Preferably, the operating ~ystem on the l~cal processor 30 is the ~MX86 ~ultitasking Executive by Kadak Pr~duct~ Limited. The AMX operatinq ~y6tem kernel provides a number of ~ystem services in addition to the applications set forth in the meth~d of the present invention.
In the preferred e~bodiment, a host ~uch as a microprocessor or a bus ~ast~r submits d command list 190 (Figure 6) to the disk array controller D through the BMIC 42. A command list may be a simple read or write request directed to ~he disk array A, or it may be a more elaborate set of requests containing multiple read/r~rite ~r diagnostic and configuration requests.
The local processor 30, on receiving this command list through the BMIC 42, parses the c~mmand list into one or more logical requests. Logical requests essentially have the ~ame 6t~ucture ~s the command list 190, but whereas ~ command li~t may include multiple re2ds or writes, ~ach logical reque~t includes only one r~ad or one write. The logical request is then ~ubmitted to the local processor 30 f~r processing. A plurality of s~ftware tasks ~perating on the local pr~cess~r 30 ~versee the execution ~f the logical request, including the transferring of data. Once the execution of the each of the logical requests comprising the c~mmand 2~9~7?,~

li~t i~ complete, ~e l~cnl proce~or 30 n~tifie6 ~he operating sy6tem de~ice dri~er.
Ref2rring n~w to Figure 6, a command li~t 190 compri~es a co~mand li6t header 191, followed by ~
variable number of reque6t ~lock6 192. The request block~ are variable in length and ~ay be ~ny combinati~n of I/O re~uests. The command li~t header 191 includes data khat applies to all reguest blocks 192 in a gi~en ~omm~nd list 190, including logical driv~ nu~ber, priority and control flags. The logical drive number ~pecifies the respective logical drive destination for all reque~t blocks 192 within the command list 190.
The priority byte is used to provide control over the processing of a command list. The disk array controller D is capable of operating upon ~any co~mand lists concurrently, and a 6pecified priority permits a command list to be processed prior to those ~lready ~cheduled for processing by the disk array controller D. The control fla~s are used for error processing and for the ordering of logical requests which have the ~ame pri~rity.
The individual request bl~cks 192 each represent an individual I/O request. By forming a command list 2~ 190 out of several individual reguest blocks, and submitting the command list 190 to the disk ~rray controller D ~Figure 2), host ~ver~ead is reduced. A
request bl~ck 192 is comprised of two parts, a ixed length request he~der 193 and a variable lengt~
parameter list 194.
Each request header field 193 includes a linX to the next request blD~k 7 92, referred to as next request offset, the type ~f I/O com~and, BpaCe for ~ return status, a bl~ck or ~ec~.or address, a bl~ck or ~ector count, ~nd a count of scatter/gather descriptor -~4~
~tructuxe element6 for two StG ctructures. The request header i6 ~ totnl of 12 byte~ ~n length.
~ he next reque6t of~set block is provided to llow the disk array controller D to quickly ~nd efficiently traver~e the lict of variable request blocks 192. The next request off~et block comprise~ ~ pointPr which specifies an off6et ~f ~n" bytes from the current ~ddress to t~e next reguest bl~ck~ This ~ield makes t~e com~and list 190 a set of linked list logical request~ 192. The la~t request ~lock 192 has a value of OOOh in the next request offset to ~ignify the end of the eommand list ~90.
The parAmeters in the parameter list ~94 ~re created as data ~tructures ~nown as scatter/gather (S/G) descriptors, which define ~ystem memory 58 data transfer addresses. The ~catter/gather descriptor counters in each request ~eader 193 are used to desiynate the number of scatter/gather descriptors 194 which ~re utilized in the particular reguest. The n~mber of ~catter/gather descriptors 194 associated with the request block 192 will vary. ~urther, if the command is a read command, the request may contain up to two different ~ets of scatter/gather descriptors.
Each ~catter/gather descriptor 194 contai~s a 32 bit buffer length ~nd a 32 bit address. This information is used to determine the y~tem ~e~ory data transfer address which will be t~e ~ource or destination of the data transfer. ~nlike the request blocks 192 in the co~mand list, the 6catter/gather descriptsrs must be contiguous and, if there cxists a second scatterlgather descriptor set for a request, it must ~irectly follow the first ~et of ~catter/gather descriptors. T~e c~mmand bl~ck in the request header 193 specifies the function of the particular request bl~ck and implies the format of the parameter list.

- 2~7`~

~ e ~ubmi~siQn ~f ~he command list and the notification of a co~mand li6t co~pletion are ~chieved by a pr~tocol which uses I/O register~ (not ~hown) ~n the ~MIC 42. To ~llow ~ultiple out6tnnding requests to the disk ~rray controller D, these I/O registers utiliz2 two ch~nnel6: A command list submit ch~nnel and a command li~t complete channel. For a ~ore complete description of the submission of a co~m~nd list to the drive ~rray controller, ple~se 6ee U.S. Patent No.
5,101,492 to Stevens et al, which is hereby incorporated by reference.

A brief overview of the manner in which a command list from the ~ost is executed by the drive array controller D to access data in the drive array A
according to the prefPrred emb~diment ~f the invention is deemed appropriate. As previously mentioned, when the host generates a co~mand list to the controller D, the local processor 30 parses the command list into one or ~ore logical requests. A task referred to as the mapper task examines the l~gical requests and organizes the logical reyuests into a plurality of physical drive request lists fDr the individual drives. On write request~, the mapper task also determines i~ the Pw RAM
71 is full and, if EO, the task enables write posting for logical request~ ~hat ~re ~maller than a given size, ~maller than a full stripe write in the preferred embodi~ent, ~nd disables p~sting for logical requests greater than a qiven ~i~e, which is greater than or equal to a full tripe wTite in t~e preferred embodiment. This prevents more burdensome partial ~tripe writes ~rom having direct access to the di~k array ~ while allowing less burdensome ~ull stripe writes to be performed directly to the array A.

. .

, , 2~778~

Once the ~apper ta6k ha~ broken up the logical reguest6 ~nto ~ plurality o~ i~dividual drive request lists, a task referred to a6 ~he ~cheduler task examine6 each r~gue6t list, ~arking each reguest as a read ~it, read ~lss, posted write, or drive zrray write. The scheduler ta6k th~n ~plits up the request lists into individual ~rive queue~ for Qach o~ the drives, including the PW RAM 71. The ~cheduler task al60 ~icks off" or ~nitiates transfer of the read requests at the head o~ each of the drive queues if the respective drive queue was previously empty. ~he ~cheduler task al50 inv~kes a task referred to as the transfer task (not Ghown) which initiates host data transfer~ ~or each of the write requests. When the host write dat~ is received, the tran~fer task initiates transfer of this write data to its respective destination, either the PW RAM 71 or the drive ~rray A.
A task referred to as post processor handles post processing of each re~uest ~uch as status updates in ~he PW RAM 71. A tas~ referred to as the degueue task is responsible f or ini~ia~ing the remaining ~ransfers in each of the drive gueues after the scheduler task has initiated the request st the head of each of the queues. I~ a DMA channel of the transfer ~ontr~ller 44 is not ready ~or a reqyest ~t the head of a drive gueue, the dequeue task places this reguest in one of ..
two channel queues and then examines the next request.
The post process~r task initiates transf~rs o~ the reguests in the channel queues.
A task referred to as the flush tasX continually scans thr~ugh the PW RAM 7~ 6earching for dirty data t~
flush to the drive array A. The ~lush task coalesces partial stripe writ~ int~ full ctripe writes, if possible, and generates ~ogical requests 6imil r to l~gical requests creat~d by the host. L~gical requests " .: .: .

, . !

~ ~ ~ 1 rl 3 7 created by the 1u6h ta6k are proces~ed thrcug~ the ~apper, scheduler, ~nd po t pr~cessor t~6ks in ~ ~anner ~imilar to that of ~ host generated loyical request.
By coalescing p~rtial ~tr~pe write6 into ~ull 6tripe S writes, the number of ~ctual operations t~ the drive ~rray A i6 reduced, resulting in greater 8y6tem efficiency.
In the preferred æmb~diment, the ~lush task generates a lGyiC2l write request ~egardless of whether the tack was successful in coale~cing partial tripe writes into a larger write. In an alternate embodiment, t~e ~lush task does not generate a logical write request until it bas been ~uccessful in coalescing partial ~tripe writes into a full stripe write or greater, unlass a flush i~ required for other reasons, 6uch as i~ the PW RAM 71 is ~ull or a line needs to be replaced.

Referring now to Figures 7A-7C, a flowchark illustrating operati3n o~ the mapper task in the disk array contr~ller D is ~h~wn. The mapper task perf~rms the following ~eps for each logical request, generating a plurality of reques~ lists that are provided to the scheduler task/ discussed below. In step 202, the ~apper task receives a logical request generated by the local processor 30 ~rom a c~mmand list generated by t~e host requesting acces~ to ~he disk array A. In ctep 204, the mapper task determines t~e respective disk and the sector where the r~quest begins. In ~tep 206, the mapper task de ermines the particular fault tolerance m~de being used on t~e logical volume being acce~6~d, either parity fault tolerance or ~irr~ring, and whether the parity is .2~ 77~S~ ', distributed par~ty, otc. ~he ~pper ~tores this infor~ation ln the respectlve ~rive request.
In step 2D8, the ~apper tAsk determine~ the number o~ headers involved in the respective reguest and c2ts certain counting variables accordingly to guarantee that the entire l~gical request i~ perfo~med. In 6tep 210, the m~pper ta~k determines if ~he re~ue6t i6 a ~ost request for later u6e. I~ the request ~ not a b~st request, then the reque~t was cre~ted by the ~lush task to ~lush data ~rom the PW RAM 71 to the drive ~rray A. A request created ~y th~ flush t~sk is treated diff2rently than a host generated request in the ~cheduler task, which is discussed bel~w. In ~tep 212, ~he mapper task determines i~ the PW RAM 71 has ~een configured for operation. If ~o, then the ~apper task determines if posted writes are enabl~d for the respective l~gical volume being accessed in 6tep 214.
If posted writes are enabled for the respective logical volume in step 214, then in ~tep 216 the mapper task enables read operations fr~m the PW RAM 71 for this logical request.
In step 218, t~e mapper task determines if the PW
RAM 71 is full. If the PW RAM 71 i~ not full in ~tep 218, then the ~apper task enables write p~sting in step 224 for the l~gi~al request and then pr~gresses t~ step 240. Thus write posting is perform~d for all writes when the PW RAM 71 is n~t full. If the PW RAM 71 is full in ~tep 218, then in step 220 the mapper task determines if the size of the request i5 less than certain limits. As previously discussed, if the Pw RAM
71 i~ full, then the c~ntroller D delays posting reguest~ having 8 ~ize ~elow a certain predefined limit and all~ws requests above a certain ~ize limi direct access t~ the drive array A.

~. - :: . . .;

2~77~2 ~29-~ the preferred e~b~di~ent, the respecti~e li~it i6 6et at a full ctripe writ~ or gre~ter. By allowing ~ ~ull ~tripe write to be i~mediately sent to the drive array A, relatively little delays nre i~troduced ~nce the ~ull ~tripe write does not r~lire prior reads or parity generation. Smaller or part~al ~tripe writes ~re delayed inst2ad of b2ing provided to the drive ~rray A ~ince these request~ involve preceding read operatlons for parity and ~ay hamper ~ystem per~ormance. However, if the write is greater than a full Etripe, then the write will essentially include a full ~tripe write nnd a partial ~tripe write, and performance of the partial stripe write will require preceding reads and hence ~ay adversely a~fect ~ystem performance. Despite this, only writes that are less than a full stripe write ~re delayed. One reason for this is that the possibility o~ coalescing partial stripe writ~s into full stripe writes i~ grea~er on individual partial stripe writes as opposed to partial ~tripe writes resulting from a write operation greater than a ull ~tripe write.
The mapper task makes a distinction between whet~er parity ault tolerance is being used in ~tep 220. If parity fault tolerance i6 being used, then in ~tep 220 the ~apper tasX determines if the write is less th~n 24 kbytes and in step 222 the mapper task enables write posting ~or requests that are smaller than 24 kilobytes~ As previously mentioned, the stripe size of the array A is 32 kbytes. Assuming a 3 ~ 1 mapping 6cheme, 3 drives or 24 kbytes of data storage are available, ~nd thus 24 kbytes c~nstitutes a full ~tripe. Since write posting is not enabled for writes 24 kbytes or larger, these operations will prooeed directly tv the drive arr~y Ac ,- ..... ~
. . .. . .
.,, . I
- ~
.

2 ~

If parity is not being u~ed, then ~he ~pper task deter~ines in step 220 if the write i6 le65 than 4 kbytes ~nd enables write posting for reguests that ~re less than 4 kilobytes in step 222. Write reque6ts gr~ater than or ~qual to 4 kbytes are no~ posted but are provided directly to tbe drive array A. Since parity is not b~ing used, no reads for parity are required, nnd thus t~ese operations do not hamper ~y~tem performance. However, ~he mapper tas~ requires writes ~maller than 4 k~ytes to be pDsted to provide the flush task with the opportunity t~ coalesce these various small writes into a larger logical request, thus increasing 6y5tem efficiency.
For the requests in ~hic~ writ~ pos~ing is enabled, it is noted that these requests are not performed immediately ~ince the PW RAM 71 is full. As described bel~w, the ~cheduler task operates to delay the execution of requests if the PW RAM 71 is full.
After enabling write posting in ~tep 222, or if the requests are greater than their respective limits in 6tep 220, the mapper task then pr~ceeds to step 240.
Also, if the PW RAM 71 was not configured in ~tep 212, then the mapper task pr~ceeds directly to st~p 240, and write posting is n~t enabled for t~ese reguests.
If write postinq was not Qnabled for the respective volum~ in step 214, tben in ~tep 226 the mapper task determines if the PW RAM 71 is enabled for read operations. If 8~, then in step 228 the mapper task enables read oper~tions f~r thi~ logical reguest t~ the PW RAM 71. The mapper task then progresses from ~tep 228 to step 240.
In step 240, the mapper task determines i~ an operatin~ ~ystem c~mmand referred to as the "write to media" command has been a~serted. In the pre~erred embodiment, a meth~d i5 pr~vided whereby the operating , ., -2~77~

cystem can request ~hat write posting not be enabled for certain write co~mand~. The operating ~ystem uses this command ~or certain extrem~ly 6ensitive data that it de~ir~6 not to be written to the PW R~M 71, but rather the operating ~y~t2m desires that the data be written immediately to ~he ~rive array A to ensure that this data is immedi~tely stor~d on the drive ~rray A.
If the "write to median cvmmand i~ ~et ~n ~tep 240, then in ~tep 242 the ~apper task disables write posting for the respective request. The ~apper tafik then progresses from ~tep 242 to step 244. If the ~wTite to media" command iG not set in ~tep 240, then the ~apper task progresses from ~tep 240 to 6tep 244~
In ~tep 244, the ~apper task determines the respective ~ault tolerance ~ode being used ~or the respective request. In ~tep 246, the mapper task determines if the current request is a write which utilizes fault tolerance tQchniques 6uch as either parity or mirroring techniques. If 60, then the mapper task determines in ~ep 248 if the write operation is to be p~sted to t~e PW RAM 71. If th2 ~urrent fault tolerant write operation is not to be posted to the PW
RAM 71, then in ~tep 250 the mapper task ~ets certain variables regarding parity or mirrored write generation for the write request. These variables are used by t~e mapper ta~k later in ~tep 252 t9 generate reguired data guard operations ~uch as parity reads ~nd writes if parity is being implemented, or mirr~red writes if mirroring is being implemented. Data guard is a term used in this specification to generically re~er to variou~ disk data integrity or fault tolerance improvement technigues, primarily mirr~ring ~nd parity.
Here it is noted that a fault tolerant or data guard write that is not b2ing p~sted in ~tep 248 may be either a data guard ho~t wri~e that is n~t ~eing posted ~ ,. , ; :
, ~
:, .
. :
.

2~77~
.

or ~ data guard flu6h task write, which ~y Bef inition will not be posted. I the write operation i6 a host fault tolerant write and it i5 determined that the . write is to b2 posted in ~tep 24~, then t~e data guard information, i.e., ~he neces6ary parity reads ~nd parity writes or ~irroring writ~s are not generated now, but rather th~ extra dat~ guard oper~tions requlred ~re gener~ted by tbe ~apper ta~k when the data ~s written back or flushed from the PW RAM 71 to the drive ~rray A. Thus a host write that i6 to be posted is treated as a non-data guard write to the PW RAM 71 regardles~ of whether d~ta guarding i~ b~ing implemented. T~erefore, ~f the write operation is not a fault tolerant host write in ~tep 246, or if it is being posted to the PW RAM 71, then the respective operations required to implement parity or mirroring operation need not be generated here. If the respective reguest is determined not to be a ~ault tolerant write operation in ~tep ~6, i.e., if the ~rite i6 a non-data guard write, the mapper task ~dvances directly to tep 252.
In step 252, the mapper task performs the task of breaking up the logical request into ~ number of phy~ical reguests or individual drive reguest lists.
This operation involves generating a gueue of individual drive reguest lists, i.e., generating a linXed list of data 6tructures representing reguests to the individual dri~e~. In ~tep 254, the mapper task generates the requir~d p~rity and mirroring operations for data guard write requests that are being provided directly to the drive array A, 6uch as non-p~sted host writes and flush wri~es whic~ require data guard reguest~.

2 ~ 2 Qeferring now to F~ure 12, a r~guest 116t ge~erated by ~he ~apper tnsk ~ay generally compri~e either simple, i.~., non d~ta guard, reads fron the PW
R~M 71 or disk ~rray A, ~ ahown in Fig. 12tl), ~imple, S i~e., non data guard, writes, whi~h c~pri~e both non data guard writes to the dri~e arr~y A ~nd all writes posted to the PW R~M 71, a~ 6hown ln Fig. 12(2), parity writes to the drive array A, ~s ~hown in Fig. 12(3), and ~irrored ~rites to the drive ~rr~y ~, ~s ~bown in Fig. 12(4).
As ~h~wn, simple reads ~n Tig. 12 (l j are linked by a pointer referred t~ as next ptr. Likewise, ~i~ple writes in ~ig. 12(2) are linked by next ptr. Parity writes to the drive array A include reads for parity linXed by the next_ptr, data writes and ~ssociated Wblocker" writes linked by next_ptr, and one parity write. The reads ~or parity and the data writes are linked by a point~r referred to as seq_ptr. Likewise, the data write and blocker writes are linked to the parity write by s~q ptr. The blocker writes ~ct ~s placeh~lders in t~e respective drive gueues to ~nsure the integrity of the parity information with its corresponding data. A bl~cker write merely reserves a place in the drive queua, and no write operation actually takes place.
Mirrored writes include b~th n~rmal writes and mirrored writes. The n~rmal writes ~re linked by next ptr, as shown. The ~irrored writes are linked t~
their corresp~nding data writes by ~eq ptr. Other types of parity ~rites and ~irr~red writes ar~
generated, ~ well ~ simple reads and writes, and the writes in Fig. 12 ~re illustr~tive only.
The parity write illustrated in Fig. 12(3) is one example of a parity write to a sin~le drive in a traditional 3 ~ 1 ~apping scheme, which was described ,- ~ , " ,.

: : . . : : .
, ~ -- : . , ., : ;, ~ :...... ;. :
, ~ , . .

7 ~ 2 in Figure 1. A~ ~uch, the write requir~6 two preceding reads for parity to th~ r~aining two data drives to determine the data currently on these two drives.. ~his data is used for par~ty gener~tion. The actu~l data S wTite ~ncludes a datn write to one dri~e and two blocker writes to the remaining two unwritt~n data drive~. Finally, the parity wr~te is perfor~ed to the E,ole parity drive.
The structure of these dif~erent types of request list6 i6 imp~rtant in under~tanding the operation of t~e 6cheduler task ~ecaus~ th8 task initially examines the ir~t requests in a request list linked by next ptr and indexes through seg_ptr to examine ~he remaining requests, if any. Therefore, indexing using seq_ptr is not required to examine all of the requests for simple reads and writes in Fig. 12(1) and (2) because all of the requests in the request list are linked only by ~ext_ptr. Indexing through ~eq_ptr i5 used, however, to examine the data write requests in the parity write.
If the request is a parity write, first the task examines the reads for parity connected by next_ptr.
When t~e task learn~ t~e request list is a parity write, it indexes through the seq_ptr t~ the host writes and examine~ these reguest~. For reasons stated below, indexing t~r~ugh se~_ptr i5 not used to examine a p~rity write reguest in a parity write request list or mirrored write requests in ~ mirrored write request list. In ~ddition, the ~cheduler task uses r~cursion to index through the various requests in a parity write or mirrored write when ~eparating requests into their respective drive queues, ~s is explained bel~w.
The ~cheduler task examines ~ach o~ the requests in a request li~t ~nd marks the xequests accordingly before separating the commands into indivi~ual drive queues. This is necessary because, in some instances, : :. : .: , :

, :.
. : . -: , 7 7 ~

the 6cheduler task i~ requir~d to wait for lines in the PW RAM 71 ~o be flu6hed ~efor~ ~t can initiate a reque~t. Examples of 6UC~ ~nst~nce~ are a write to a dirty line in the Pw ~AM 71 wh~ch iR being flushed, or a write to ~n ~rea in ~he drive ~rray A ~here dirty data corresponding to thi~ are~ in ~he PW RAM 71 is being ~lushed. In these ~n~tances, the ~cheduler task ~ust wait for ~he flush oper~tion to complete b2fore sending the reguest6 to the individual drive queues.
If the task examined two cons~cutive reads and ~ent these reads to their respe~tive drive qUeUe5, the ta~k would be unable to d~lay the writes in the reguest list i~ it ~eeded to do 60 because the reads would have been already ~ent to their respective drive queues.
Therefore, the scheduler task examines the data write ~equests in a parity write reguest list prior to separating requests into t~eir respective drive queues.
It is noted ~hat these ~ame concerns do not arise in ~irrored write re~uest lists because the mirrored write requests ~re exact copies of the data write reque~t~. If ~ data write request in a ~irrored write request list was reguired to be delayed to allow a flush to complete, the flush would result in data written to the data portion of the drive array A as well as the mirrored portion. Thus, ~irrored write requests in a ~irrored write reguest list are not examined prior to ~eparating the reguests into individual drive gueues. In addition, the parity write in a parity write reque~t list is not examined because parity data is not ~tored in the PW RAM 71. Thus there is no need t~ exa~ine the parity write requests since parity data cannot be inv~lved in a flush operation.

:: ~,' : ~ "~
~ -, ~ : .

~36-~ eferrin~ now to Figure~ BA-F, the och~duler task is ~hown! The ~cheduler ta~k ~naly2e6 each request list generated ~y the ~apper task, determines the types of reque~t~ involved with ~ch respective reque6t list, and mark6 the reque6t~ accordingly. The scheduler then divides the requests in the list into n~ne individual drive queues. The nlne drive queues include 8 drive queues for the drive array A and a ninth drive queue for the PW RAM 71. Once ~he requests in a request list have been partitioned i~to individual dri~e queues, the dequeue task and the post proc~ssor task execute the requests in these respective drive queues to actually perform the data tr~nsfer6 and status updates to accomplish the reads ~nd writes.
15The schedul~r task determines if the respective request6 involve ~impl~ ~ost read or write requests, :~
i.e., reques~6 that do not require or do not yet include data guard operations, or whether the request is a write request to be provided directly to the drive array and includes data guard operations, such as a parity write or ~ mirrored write. The ~cheduler task also mark6 the destination of each of the reguests.
Read hits to th~ PW ~AM ~1 and writes that are to be posted are marked as ~oing t~ the PW RAM 71. PW RAM
read misses and writ~s that are n~t to be p~sted are marked ~or the drive array A.
Referring nGw to Figure 8A, the controller D
executing t~e ~cheduler task examines individual reguest6 in the reguest queue. In step 302, the scheduler task determines if the PW RAM 71 is configured f~r oper~ on. ~f ~, the scheduler task then determines in ~tep 304 if any requests in the request list have n~t yet been examined. ~f une or m~re reguests remain t~ be examined, then the cheduler task examines t~e next reguest in ~tep 306 t~ de~ermine ) , " ~ "~

~7~

if ~he reque~t is ~ ~imple ~rite or read regue6t that i~ not related to fault tolerance. ~n ~his ~tep, the ~cheduler task is only looking for 6imple data write or read reguests, not parity writes. These requests include the si~ple reads and write6 in Fig. 12tl) ~nd (2), ~he reads for parity and dnta writes ln Pig.
12(3), ~nd the dat~ writes in Fig. 12(4). Generally, the types of requests being ~xcluded here are parity writes and ~irrored writes.
~f the request being exa~ined involves a simple write or read, then in ~tep 308 the ~cheduler ta~k constructs a bit ~ask for the request. As previously discussed, the PW RAM 71 includes a plurality of lines storing data wherein each line corresp~nds to 16 ~ector~ of a respective drive. The status information stored in the PW RAM 71 includes a plurality of 16 bit words storing a bit map representing data about each of the 6ectors comprising a line. As previ~usly discussed, the bits in one 16 bit word represent whether each of the respective sectors are dirty, and the bit~ in another word represent whether each of the respective sectors are valid, etc. In 6tep 308, the scheduler task constructs a bit ~asX for the Pctors of the respective line in the PW R~M 71 whi~h is being requested. In this manner, by using the bit mask when the st~tus information in the PW RAM 71 is updated, only the status information bits pertaining to that pGrtion of the line, i.e., the respective ~ectors being reguested, need to be manipulated or changed. It is n~ted that the bit ~ask will ~nly be used on p~sted writes and read hits to the P~ RAM 71, ~ut will not be used on reads or writes directly to the drive array A.
Once the bit ~sk has been constructed in ~tep 308, then in step 310 the ~cheduler task determines if the request has a corresponding line entry in the P~

' ~7~`3~ .

~AM 71. If 80, th~n in ~tep 3~2 (Fig. 8~) the ~cheduler task determines if the current request is read or write request. I~ the reguest i5 a read operation, then in ctep 314 the ~cheduler ta6k deter~ines if ~11 of the requested 6ectors comprising the line entry in the PW RAM 71 zre valid. If all of ~he sector6 in the line entry are valid in ~tep 314, then ~n ~tep 316 ~e request $s ~arked as ~ read hit and variables arR set indicatlng that the destination ~f thi~ request i~ the PW RAM 71. Since the request is a read hit, ~he request can ~e 6erviced from the PW RAM
71 without ~ccessing the di6k array A. In step 318, the respective line in the PW RAM 71 is locked to guarantee that the line is no~ accessed or replaced before the read ~peration from the P~ RAM 71 completes.
If all of the requested sectors in the PW RAM line entry are determined not to be Yalid in step 314, then the read request is ~arked as a read miss in step 330, and variables are set indicating that the destination of this request is the ~rive array A. In step 332 the ~cheduler task determines if any of the regues~ed sectors in the cache line entry are dirtyO If so, then in step 334, the ~cheduler task waits for the line to be flushed to the drive arr~y A. The flush of the dirty ~ectors i~ allowed to compl~te t~ guarantee that the read request to the drive ~rray ~btains the current or c~rrect data tored in the PW RAM 71 that otherwise had not yet been written to the drive array A.
Otherwi6e, the read request may obtain ~tale or incorrect data from the drive array A. Rere it can be assumed that tbe flush task is already running ~ince dirty data resides in th2 PW RAM 71. ~s discus~ed below, the flu~h task runs continually, subject to being interrupted by ~ higher pri~rity task, while dirty data resides in t~e ~W RAM 71. If a flush is not ,.

~ 0 ~ 2 running, then the echeduler ta~ generates ~n error ~ess~ge (not ~hown~ bec~u~e this condition ~hould never occur. Vp~n completion o~ the flush, ~he ~cheduler .. task ~dvances to step 440 ~ig. 8E).
I~ in ~tep 312 the request ~s determined to be a write operation, ~hen in ~tep 340 (Figure 8C), the ~cheduler task determines if ~he write reguest i~ a request from the h~st. As previously di6cussed, a write request ~ay either be generated by the h~st, i.e., ~ pr~cessor or external device, or by ~he flush task. The ~lush task continually ~cans the PW R~M 71, coalesces partial ~ripe write requests into full stripe writes, and generates l~gical write requests to the mapper, ~imilar t~ A host. These flush logical requests are executed in a manner similar to a host generated logical request, with ~ome important differences. The ~cheduler task determines if the request i~ a host request because of these differences, which are discussed below.
If the write request i5 determined to be a host write request in step 340, then in step 342 the scheduler task determines if p~ting is enabled. If posting is enabl~d in ~tep 342, then in ~tep 344 the scheduler task determines if the respective line in the PW RAM 71 is currently being flushed. I the line is ~t b~ing flushed in ~tep 344, then the ~cheduler task advances to step 350. If the ~ine entry is being flushed in ~tep 344, then in ~tep 346, the scheduler task waits until ~he ~lush completes and all dirty sectors in the line have been written t~ the drive array A. The cheduler task then advanc s to step 350.
In determining if the line is being flushed in step 344, the ~cheduler task is concerned with data c~rrupti~n problems. If the respective line i5 not being flushed, then the write request simply overwrites ., ; .,,, ~ :
: : .:~ ,. ; : : .
, ~ :
.

2 ~

the current dAta ~now 6tale data) ~hen it i6 p~sted to the ~W RA~ 71, thu~ cau~ing no pro~lems. ~owever, if the line i6 currently being flu6h~d, then the scheduler task ~u6t wait for these ~ectors to be flushed to the drive array A. Oth~rwise, i~ ~he new write reguest was posted or wT itten to t~e PW RAM 71, ~nd the flush of that respective line ~ompleted ~bere~fter, the ~tatus information o~ t~e line ~n the PW RAM 71 would be changed to reflect t~at t~e line entry contained clean data, when in ~act the write reguest that was ~ust performed r~sulted in dirty data being ~tored in the line entry, thus causing data corruption problems.
In ~tep 350, the write request ls ~arked a~ a posted write, and the destination for this write request is 6et to the ~W RAM drive 71. In ~ddition, the line entry in the PW ~AM 71 is locked in step 352 50 that ~his line ~annot be replaced or flushed until the write operation completes. The scheduler task then advances to step 440 tFig. 8E). It is noted that the re~uest is marked to the PW RAM 71 in step 350 reyardless o~ whet~er ~e PW RAM 71 is full. It was previou~ly determined in ~tep 310 that the request ~ad a PW R~M line entry, and thus even if the PW RAM 71 is full, tbe new requ~st can ~i~ply overwrite th~ old request in the respective P~ RAM line.
If posting is not enabled in ~tep 342, then a write t~ the disk arxay A must ~e per~rmed ~s opposed to a write to the PW RAM 71. In step 360, the .
~cheduler task determine~ if any of the respective sector6 in the corresponding PW RAM line are dirty. If so, then in 6tep 362 the task determines if the line is currently being flus~ed. If s~, t~en in ~tep 364 the scheduler task waits for the flush operation t~
complete. If the line i n~t currently being flushed, then t~e t~sk ~ark~ all of the ~ectors in the line as , ~ . : . ; , , ` , ~ , : ,. :

r~ ~3 2 41~
clean in ~tep 366 to prevent ~ ~ub6equent ~lu~h of the~e ~ectors. Here it i~ ~portant ~at dat~ ~arXed as dirty not remain $n the PW RAM 71 a~ter completion of a write to corresponding ~ectors in tbe drive nrray A. I~ this were to ~ccur, then ~ sub~eguent flush of the PW RAM 71 would ov~rwrite the correct d~ta in the driv~ arr~y with ~tale dat~. ~her~fore, if a flush is running to thi6 lin~, ~he ~lu~ i6 allowed to complete.
Otherwi~e, the dirty sect~r are ~arked clean to prevent ~ ~ubseguent flush from occurring.
~ pon completion of ~ither 6teps 364 or 366, or if n~ 6ectors were dirty in ~tep 360, then in ~tep 368 the scheduler task ~arks the respective sectors in the line invalid. This prevent~ a host ~rom subsequently reading ~tale or incorrect data from the PW RAM 71. I~
step 370, the scheduler task ~arks the destination of the respective write reguest to the disk array ~. The task then advances to step 440 (Fig. 8E).
I~ in step 340 the write request was determined not to be a host reguest, b~t rather was determined to be a flush write request, then in step 380 (~ig. 8D) the scheduler task mar~s the write request as a PW RAM
flush request, indicating th~t the next operation for this request involves an access to the Pw RAM 71 drive.
In ~tep 382, the 6cheduler task ~aves the respective drive number where thi6 write i~ int~nded in the drive array A. In ~tep 3~4, the ~cheduler t~sk changes the command in the request to a read operation in order for the l~cal processor 30 ~o retrieve the data from the PW
RAM 71. The ~cheduler task alco ~ets a ~lag in ~tep 386 if a data guard operations, either parity writes or mirrored writes, are reguired. The read operation generated in ~tep 384 then pr~ceeds normally. The original write reguest, which enables the local processor 30 to write the data retriev~d fr~ the PW

- ~ ; . .
, 2~7~

RAM 71 to the driYe array A is restored l~ter ~n the post pr~ces~or ta6k, di6cus~ed below. For th~s reason it i~ nec~ssary to 6ave t~ drive ~umber dPstination for the original write request in ~tep 382. Upon completion of ctep 386, the task ~dvances ~o step 440 (Fig. 8E)~
If in step 310 tFig. 8A) the ~cheduler t~sk determines th~t the respective line involved in the request does not reside in the ~W RAM 71, then in ~tep ~02 the ~cheduler ta6k determines ~f the request is a write request. If the request i~ detersined to ~e a write request, then in ~tep ~04 the ~cheduler task de~ermines if ~he reguest i~ a bost reguest. I~ 80, then in ~tep 406 the 6cheduler task determines if write posting is enabled. If posting i~ enabled in 6tep 406, then in step 408 (Fig. 8E) the scheduler task marks a line en~ry in the PW RAN 71 for replacement using a true LRU algorithm. Here it i6 noted that i~ the PW
RAM 71 is full, the task delays until ~ line entry is available. In step 409, the t~sk marks the request as a post~d write and ~ets the destination of this request to t~e PW RAM 71. In ~tep 410, the respective line in the PW RAM 71 to w~i~h thi6 write request i~ destined is l~cked ts guarant~e that the line is n~t replaced again bef2re t~e posting operation is performed. The task then advances to ~tep 440. If write posting is not enabled in step 406 (Fig. 8A~, then in step 412 (Fig. 8~) the ~cheduler task mark the destination of the write request ~ the disX ~rray A. Upon completion of step 412, the ta~k advances to ~tep 440.
If in ~tep 404 the write reguest is determined not to be a host request, but rather i5 determined to be a flush request, the~ an err~r ~iynal is generated in ~tep 414. This i~ deemed to ~e an error because a flush request, i.e., a requect generated by the flush , . . .
., .

.

~Y~ 78~

task, inherently involves a line entry which resides in the PW R~ 71. In ~rriving at thi6 ~tep, it was previously determined in ~tep 310 that this request did not have a line entry in the PW RAM 71. Therefore, it should ~e impossible for thi5 raquest to be determined a flush request in step 404. Thereore, this situation should never occur unless an error has occurred.
If in ~tep 402 (Fig. 8A) the reguest is de~ermined to be a read request ! then this read request inherently involves a read ~is5 as the line being read does not reside in the PW RAM 71. The scheduler task advances to step 420, marXs ~e request a~ a read miss, and sets the destination of the request to the disk array A.
The scheduler ~ask ~h~n advances to step 440 (Fig. 8E).
If in step 306 the request is determined to be a write or read request involving fault tolerance, i.e., a read request for pari~y, a parity write, or a mirrored write, then in step 422 the scheduler task sets the destination of the request as the drive array A. The scheduler task then advances to step 440 (Fig.
8E).
Upon completion of any of ~teps 410, 412, 420, 422, 318, 332, 334, 352, 370, or 386, in step 440 the scheduler task determines if a flag referr~d to as the parity write flag i5 se~. The parity write flaq is set to indicate that the current request being examined is a write operation which includes parity gensration, i.e., a parity writ~ request list. This ~lag is set below in skep 462 when the tasX learns that the request list comprises a parity write. The following sequence of step~ from step 440 to ~tep 47~ allows the sch~duler task t~ peek ahead at the data write request~ ln a parity write reguest list before any of the read for parity requests are initiated to the drive array A~ As previously no~ed, the scheduler task must be able to :., , , . :
.. . . . . .

~77~ -wait for any flush operation to co~plete ~or data coherency reasons. T~ese ~ame concerns do not arise with regard to mirrored writes because mirrored writes do not include preceding reads. Also, all oS the non-mirrored data writes in A mlrrored write request list can be examined by indexing through the next ~tr and the mirrored writes will be identical to the data writes. Therefore, the parity write flag is not set for ~irrored writes. In addition, the parity write flag i~ not 6et ~or parity writes ~ince parity data is not stored in the PW RAM 71.
If the parity write flag is set in step 440, then the request, which will be a subse~uent write in a parity write reguest list, is marked as a parity operation in step 442, and the task advances to ~tep 468. The write is ~arked as a parity operation so the write is not ~xamined again later in step 488, discussed below, when recursion is used to send the requests to their respective drive queues.
If the parity write flag is not set in step 440, then in step 450 the task determines if more unexamined requests connected by the next ptr remain in th2 request list. If so, then in step 452 the task increments to the next request and then advances to step 468. I~ th~ parity write ~lag is not set in step 468, which will be the casa for th~ reads or parity in a parity write request list, and which will also be the case for simple reads and si~ple writes, then the tas~
returns to step 304 to exa~ine subseguent regu~sts.
I~ the current request is the last request linked by the next_ptr, i.eO, i~ the last imple read or write request in the request list, or the last read for parity or write in a parity nonposted write, then in step 460 the task determine~ if the request is a read for parity.

- 2~77~2 1~ t~e request i~ a read for parity in step 460, then in step 462 the task indexec in the reguest list to the first write. At thi6 point, the request list will include data write3 and ~locXer wr$tes connected by n~xt DtX, and ~ parity writ~ connected to the above requeste by seq_ptr, a~ previou~ly discu~ed. The task sets the parity write flag for this request list and then advances to step 468. If the request is ~ot a rsad for parity in step 460, then it sets a ~lag indicating that thi~ is the last request in the request list. This will be ~he last reguest in the list since it w~s determin~d that there were no other requests connected by next ptr in step 450. The task then advances to step 468.
As previously noted, in step 468 the task determines if the parity write flag is set. If 60, then in step 472 the task 6earches through the requests for host writes, ignoring the ~locker requests. In step 474, the task ~arks the first host writ2 that it finds as the next reguest t~ be examined. The task then returns to step 304 to again traverse through the above steps and examine ~he host write. This process repeats for all of the host data writes. If no ~ore host data writes re~ain in the request list, then no request is marked in step 474, and the t~sk exits to step 482 from step 304~ It i~ noted t~at ~r subse~uent ho~t writes, the parity write flag will be set, and thus subsequent host writes are simply ~arked as parity as in step 472. The parity write request is not examined.
In summary~ with r~gard to nonposted parity writes, the read ~or parity requests will traverse through the above st~ps with step 452 op~rating to increment through these until the la5t read for parity.
When the last read for parity occurs, step 462 operates . ., :: ":
:: . : . . :

20~82 -46~
to index into the ho8t write and blocker requests, with steps 472 and 474 operating to ~ilter out the blocker requests. In ~hi~ manner, ~he scheduler task lo~ks at all of the ho~t write request~ in a request list prior to separating the requests in~o individual drive queue~, which is ~i~cussed below. This enables the scheduler task to fitall write request~, if necessary, before the precedi~g reads are sent to the drive gueues, thus allowing PW RAM lines that require flushing to b~ flushed.
If the PW RAM 71 was not configured in step 302 ~Fig. 8A), then th~ task marks the destination of all of the requests to the drive array A in step 478 and advanc~ to step 482 (Fig. 8F). When the scheduler task has finished examining and ~arking all requests in the request list in step 304, then the task advances to step 482. In step 482 the task returns to the first request in the reguest list. ~n st~p 484, the scheduler task separa~es the first number of requests linked by next_ptr into individual drive queues. For requests lists of th~ first two types, simple reads or simple writes, this will constitute all of the requests. For a nonposted parity write, this involves only the reads for parity. If more requests remain in the reguest list in step 486, l.e., if the reguest list is a nonposted parity writ8, then the task incrementa through the sequence_ptr to the next group of requests conn~cted by next ptr in step 488. The next time step 482 is executed, t~e next group o~ requ~sts will be either the host writes in a paxity write request list or the ~irrored writes in a mirrored write request list, and returns to step 302 to again traverse ~hrough the above steps. It is noted that during subsequent passes through the scheduler task, ~he task determines in step 306 that ~he requests ar~ not simpl~ reads or ~, ~ ; ~
.

2 ~

~, ~
writes, and thUB ~ majority of the examining and ~ark~ng BtepS are skippQd. The recursion i~ almed at executing step 484 once more on nonposted mirrored write~ ~nd twice ~ore on nonposted parity writes to send the host data writes znd the parity or mirrored writes to their respective individual drive gueues. If the request lists cont~in only ~imple reads or si~ple writes, then no recursion takes place, and the task ~dvanses to step 4920 Also, when the task has partitioned all of the reguests into individual drive queues, the task advance~ to ~tep 492.
I~ the request ~nvolves a write in step 492, then the task advances to step 494 and determines if the write i6 a data guard host write. If so, then the task marks the request with a flag indicating tha~ the disk request cannot take place until the write data has been obtained from the ho-~t. Upon marking the request in step 496, or if the write was not ~ data guard host write in step 494, then in step 498 the scheduler task sends the write to a task referred to as the transfer task. The transfer task initiates the operation of transferring the host write data t~ the disk array ~ -controller D. The scheduler task then determines if the current request i5 the last request in ~he list in step 500. If not, then the t~sk increments to the ~ext request in step 508 and returns to ~tep 492. If so, then t~e scheduler tas~ complet¢s and begins operation on 2 ne~ reguest list.
If the request is not a write in ~tep 492, then the task determines if the reguest is a read request in step 502, I~ not, then th~ request is a block~r reque~t, and the task advances to step S00. If the task i~ a read in step 502, then in step 504 the task mark the read as ready to proceed whe~ it reaches the head ot the queue. In step 506, the read reques~ is : "

~8~7~2 then sent to a respective channel queue to b2 executed to the respective drive if it i~ at the head of the queue. The task then advances to step 500.

Referring now to F~gur~ 9A-9D, a flowchart dia~ram illustrating operation of a task which scans the l~nes in the PW RAM 71, coale~ces partial ~tripe write operation~ into logical request~ comprising full stripe writes, if possible, and ~nds th~se flush logical request~ to the mapper task to flush dirty lines, is ~hown. The flush task uses several pointers to a~complish its operations. One pointer referred to as F PTR points to the current ~ RAM line bcing examined. Another flag refexred to as CLEAN PTR is set to a line when the flush task finds that the line is clean. T~e CLEAN P~R is cleared whenever a line is dirtied in the PW RAM 71 while the ~lush task is operating, regardless of which line is dirtied.
Therefore, if th~ flush task traverses through the entire PW RA~ 71 and back to the position of the CLEAN PTR with the CLEAN PTR still being set, i.e., no other lines were dirtied during this time, then the flush task knows that the entire PW RAM 71 contains only clean data. At this time it halts operation, negates the BAT ON signal to turn off the battery bacXup, and allows other tas~s to operate.
In step 602, the flush ta~k determines if a task referred to a~ th~ ~urfac~ analysis task i5 executing.
The sur~ace analysis task operat~s when no other tacx is operating and verifi~ data integrity and the correctnes~ of parity information in the drive array A.
This task also corr~cts bad par~ty information. ~ore information on this general operation can be obtained from Serial No. 556,646~ ~ntitled "Intelligent Disk ~) 2 ~

-49~
Array Controller Background Surf~ce Analysi~" filed on July 20, 1990 nnd h~reby incorporated by~reference. If the surface ~nalysis ~ask i~ running in ~tep 602, then the ~lush task waits in step 604 for the surface analysi~ task to compl2t~.
When the surface analysi~ task complete6 in step 604, or if surfa~e analysis was not running in step 602, then t~ ~lush task begins the ~perati~n of scanning the PW RAM 71 to flush dirty l$nes to the drive array A. The 1ush task continually operates until it ~ompletely 1ushes th~ P~ RAM 71 or is interrupted by a higher priority task. In ~tep 606, the flush tasX deter~ines lf the PW RA~ 71 is enabled and if po~ting i~ enabled. If not, the ~lush task continues performing step 602 until posting is enabled by another task. If so, then in s~ep ~08, the flush task determines if the respective line that it is examining is mar~ed as currently ~eing flushed.
If the line being examined is currently being flushed in step 60B, t~en the flush task cl~ars the CLE~N P~R i~ ~tep 610. CLEAN PTR is cleared ~ecause this line has dirty data that has not yet been flushed.
Also, the line may have ~ore dirty bits that were not marked as f lushed. If the line is not ~arked as b~ang ~lushed in step 608, the~ in st~p 612 the flush task determi~es if the line is dirty, i.e., if the line contains dirty data. If the line is clean in step 612, th~n in ~tep ~14 the CLE~N ~ i~ se~ equal to this line i~ CLEAN ~ is not currently se~ to any line. If CLEA~ PTR is set ~or ~ prior line, then CLEAN PTR is not updated to point to thi~ new lin~. Thus, here the CLEAN PIR s~rv~s a~ a marXer t~ a clean line, and if t~e flu~h task tra~erses the e~tire PW RAM 71, and returns to this lin~ an~ the CLEAN PTR is s~ill set at this line, then the flush task knows that the entire PW

2~7~.2 RAM 71 contains only clean data. Upon completion of ~ett~n~ the CLEAN ~rR in ~tep 614, the ~lush task advances to ~tep 584 (Fiq. 9D)~
I~ the line contains dirty data in step 612, then in step 616 the ~lush task deter~ines if the respective logical volume wher~ th~ dirty line would reside is functioning properly. If the logical volume is not functioning properly in step 616, then in step 618 flags are set indicating that write operations should ~0 no longer be po~ted to this logical volume, and posting is also disabled tc this line in step 619. Also, the line is marked clean in st~p 619 to prevent further examination o~ this line and a permanent dirty flag i5 set to indicate that the line is dirty. This prevents other data from overwritinq this data in the PW RAM 71.
The permanent dirty flag is cleared upon a system reset, at w~ic~ time another attempt may be made to flush the dirty data back to the PW RAM 71. The task then advances to step 684 (Fig. 3D).
If the logical volume i5 operating properly in step 616, then in st~p 620 the flush task determines if the respective logical volume is utilizing a parity scheme, either a distributed parity scheme such as RAID
level 5 or a simple parity scheme such as RAID level 4.
The flush task determines if the logical vslume is implementing parity because, i~ ~o, it is desirable that the flush task generate a ~lush logical request at the beginning o~ a ~ripe in order to per~orm a full stripe write. ~his obviates th~ necessity of having to do preceding read operations to discover the data or parity information on the unwritten sectors on the disk prior to the write, which would be required in a partial stripe wri~e operation. If th~ logical volume is not using a parity scheme, then the flush task advance~ to ~t@p 630.

77~2 If the logicAl volume i8 using parity in step 620, then in ~tep 622 ~he ~lush task determines i~ the current line being examined is at the beginning of its respecti~e stripe. ~ the line ~ 8 at the b4glnning of its respective stripe, then the flush task adYances to step 630 (Fig. 9B). I~ the line being examined is not at the beginning of its respective stripe in step 622, then in ~tep 624 (Flg. 9~) the flush task attempts to find the respective line at the beginning of the stripe. In 6tep 626 the 1ush task determines if the line at the beginning of the stripe is currently being flushed, is claan, or is ~bsent. This det~rmination is similar to the determination previously made in steps 608 and 612, and is ~ade for a similar reason, which is to deter~ine i~ the line has dirty data that can be flushedJ I~ the line i5 currently being flushed, is clean, or is absent, then the flush task advances to step 630 and resumes examining the original line which was found to be in the ~iddle of the stripe. This is because the line at the beginning of the stripe cannot be included in a logical request to be flushed i~ it is currently being flushed, ~r is clean, or if the respective line ~s not present. I~ the first line in the stripe is present in the PW RA~ 71 and is determined to be dirty and not being ~lushed in step 626, then in st~p 628 the flush task backs up to this line to be~in coalescing lines at the b~ginni~g of the respective ~tripe. Here the flush tas~ changes fro~
the lin~ it was previously examining to the line at the beginning of the stripe, and this line is now examined.
In step 630, the flush task locks the respective line. Here, the flush task sets a bit in a status word associated with th~ line indicating that the line has been locked for ~lush purposes, i.e., the line will soon be ~lushed. Locking the line prevents the host ,, ", .

, . ", ... ..

: 2~77~

from ~ccessing thi~ line and al50 prevent~ n~w data from being written to thi~ line. In step 632, the flus~ task dete~ines if th~ respective line that has been locked to ~e ~lushed has been written to, i.e., if it is w~iting for data that has been posted to the PW
R~M 71. I~ G0, then in ~te.p 634 the flush task waits for this data to arrive at the line. In ~his ~ituation, data corruption problems ~ay result if a line that has been wrltten to ~nd that is waiting for data i~ included in a flus~ logical reguest. If this were to occur, new data may enter into the resp~ctive line of the PW RAM 71 t~at would be marked dirty, and if the ~lush subsequently completed at a later time, this otherwise dirty data would erroneously be marked clean, thus locking this data in the PW RAM 71 and causing possible erroneous operation.
After the flush task has waited in step 634, or if the line was not waiting for data in ~tep 632, then in step 636 the ~lush task checks the line to determine which sectors in the line are dirty. In step 638 the flush task determines i~ the last sector in the respective line is dirty. If the last sector in the line being examined is dirty, then the flush task ~ay be able to coalesce dirty sectors from the subsequen line, or a plurality of subsequent dirty lines, to possibly form ~ ~ull strlpe write or greater. I~ the last sector in the line being examined is dirty in step 638, then the flush task ~ttempts to retri~ve a pointer to the next linQ in step 640, provided that the next line resides in the PW RA~ 71. The flu~h ta~k examine~
the next line in the same ~tripe, or if the line currently being exa~ined i~ at the end of a stripe, it examine~ the ~irst line in a subsequent stripe, to determin~ if the subseque~t line contains dirty sectors~ After attempting to retrieve the pointer to , ~

_53 D
-the next line in step 640, th~ flush tas~ determines if the next line i8 present and i~ both dirty ~nd not currently being flushed in step 650 (Fig. 9C). This determ~nation is ~i~ilar to th~ deter~ination previously made in ~tep 626 ~Fig. 9~) and the determ~nations made in step6 608 and 612. If the line is present and i~ either clean or currently being flushed in step 650, then the flush task disables che~king for ~urther dirty lines and advances to step 680.
If the l~ne is present and i6 both dirty and not currently being flushed in ~tep 650, then in ~tep 652 the task locks this line, setting the appropriate status bits. ln ~tep 654, the task determines if the line is waiting for data, i.e., if the line has recently been written to. This ~tep is similar to step 632 (Fig. 9B), previously describe~ the line i5 waitiny for data, then the task waits in step 656 for this data to arrive~ A~ter waiting in step 656, or if the line had not ~een written to and was not waiting for dat~ in step 654, then in step 658 the task determines if the first sector in the respective line is dirty. If the first sector in the lin~ is not dlrty in step 658, thQn in step 670 the task disables l~cking of this line for ~lush purpo~es, discontinues checking for further dirty lines to coalesce into this logical reque~t, and advances to 6tep 680.
I~ the first ~ec~or in the line i5 dirty in step 658, then in ~tep 672 t~e ~lush task adds the number of dirty sectors into th~ logical request that is ~o be created. In step 674 the task determines i~ th~ entire line being examined i dirty. If not, th@ task discontinues s~arching for further dirty lines to coalesce into this logical request and advances to step 680. I~ the entire line is dirty in step 674, then th~

:, .

-- 2~7~2 ~54-task increment~ to the next line in ~tep 676 ~nd then return~ to step 650 to continue looking ~or further lines to be included or coale~ced into this logical request. This process continues until the portio~ of the task from 8tep 650 to ~tep 676 determines that a line is not present, or finds a line with one or more clean sectors, i.eO, the fir~t sector clean in step 658, or any other ~ectors clean in ~tep 674, or if the task ~inds a line that is ~eing flushed or i8 entirely clean in step 650. At this point, the task discontinues looking fox ~urther dir~y lines to coalesce because no more line~ can be contiguously included into the logical request.
Having ~ound one or ~ore contiguous lines which can b~ coalesced together, the flush task now assembles a logical reguest for these lines in st~p 680 so that all of these lines ~ay be flushed back to the drive array A as one logical request. The flush task asse~bles a logical request from all of the lines found to be contigu~us and dirty in steps 612-676. In step ~82 the flush task ~ends this logical request to the mapper task previously describ~d. As previously discussed, the mapper task examines the logical request created by the flush task~ ~arks the request, and splits up the logical request into a queue of request lists. The mapper task in turn calls the scheduler task, which as previou~ly dis~ussed marks the reque~ts in each request li~t as to i~s type and de~tination, and splits up th~ r~guest into individual drive queues.
These requests are then executed by the degueu~ task and the post processor task, discussed below, along with ho6t generated request~, to actually trans~er the data between the ~ost, PW RA~ 71, and drive array A, a~
required.

. ,., . , ~

r~

In Gtep 683, ~he flush ta~k clear~ ~LEAN_ ~ .
This provides t~e mapper, ~cheduler and po~t processor tasks with sufficient time to actually flush the dirty data to the driv~ ~rray A. Thi~ prevents the PW RAM 71 from potentially being ~arked completely clean while dirty data still resides in ~he PW RAM 71 waiting to be flushed. Al~o, as previously diccussed, CLEAN PTR is cleared in step 610 if the ~lush task examines a line that has ~een ~arked as being flushed but the flush has not yet heen completed.
In step 684 (Fig. ~D), the flush task determines if any new lines have been dirtied since the last time this check was ~ade. If so, then in step 686, the task clears the CLEAN PTR. The flush task also clears a flaq referred to as DIRTY ~INES in step 686. The DIRTY ~INES flag is set when any line is the PW RAM 71 is subsequently dirtied. 8y clearing this flag, the flush task can subsequently de~ect when ~ew lines ~re dirtied the next ti~e step 684 is executed.
After clearing the CLEAN_PTR in step 686, or if more lines were not dirtied in step 684, then in step 68B the task increments to the next line. ~ere it is noted that the next line may be t~e next portion of a stripe on ~ subseguent drive in the drive array ~ or it ~ -may b~ ~ portion of a new stripe. In step ~so the task determines if i~ has traYe~sed through the entire PW
RAM 71 without any ~urther lines ~eing dirtied and has returned back to where CLEAN PTR was setO I~ so, this indicat~s that the entire PW RAM 71 has been ~lushed and therefore contains no more dirty lines. If this ~ccurs, th~n in step 692 the task sets a ~lag to indicat~ that the flush task is no longer running. In st~p 696 the flush task turns o~f the backup batteries to the P~ RA~ 71 by negating the BAT ON signal. Since the P~ RAM 71 contain~ no ~ore dirty line , batterie~

are no longer required for data protection. The flush task then terminates operations nd ena~les surface analy6is operations to r~sume in step 698.
If th~ task has not returned to where CLEAN PTR
was ~et in step 690, i.e., has not traversed through the e.ntire PW RAM 71 and returned to the C~EAN PTR line with~ut any ~urther lines being dirtied, then in step 700 the task determines if it has traversed through the entire PW RAM 71 ~ithout ~lushing anything. This situation is intended for instances where the task has sea~ched through the entire PW RAM 71 without flushing anything but yet some lines in the PW RAM 71 are locked for flush purposes, indicating that a flush is to be performed for these lines. This occurs when for some reason it i5 taking ~n unusually long time to flush dixty lines. I~ the condition in step 700 is true, then the task waits in step 702 and allows other lower priority tasks to run. Upon completion of the wait period in step 702, or if t~e task has not ~een through the entire PW RAM 71 without flushing in step 700, then the task returns to step 606 to examine th~ next line i~ the PW RAM 71. This process repeats until the flush task is interrupted or flushes the entire PW RAM 71.

Referring now to Figure lO, onc~ the scheduler task has initiated trans~er of the request at each of the individual drive ~ueues, later reque ts in each of the queues are initiated by a task referred to as the dequeue task. In step 952, the dequeue task examines a request at the head o~ the respective drive queue being examined. In step 954, the task determine~ if a DMA `
channel in the transfer controller 44 is ready for the request. If not, then in s~ep 956 the degueue task determines i~ the request is a parity reguest. I~ so, . , , ~ " - , .; ~ :
.. : ~ , : :, :, , . ~ :

~rl7~

then the task places the reguest in ~ c~annel queue referred to as the parity channel queue in step 958.
If the request i6 not a parity request, ~en ~he task places the request in a queue referred to ~R the ~irst-co~e-first-serve (fcfs) channel queue in step 960. The task then advances to ~tep 964.
If a DMA chnnnel i~ ready in step 954, then the dequeu~ initiates the trans~er in step 962. Upon completion of the transfer, the post processor task is invoked to perform various post processing, as described ~urther below. After initia~ing the transfer in s~ep 962, or upon completion of either of steps 958 or 960, the dequeue task determine~ if the transfer was a host read per~ormed ~rom the drive array A. If so, then the task initiates a tran~fer to the host in step 968. Upon completion o step 966, or if the request was not a host read, the dequeue task completes. The dequeue tasX is invoked again when other requests are ready to be executed at the head of any of th~ drive queues.

When a data transfer completes, the post processor task i~ invoXed to perform various post processing operations. When the post processor task co~plete.
post processing a task, it searches through the parity chann~l queue and the fcfs queue to determine if either of these queues contain any requests. If so, the post processor task initiates a transfer of the reguest at the head of the respective queue. I~ the parity channel queue and the fcfs queue do not cont~in any requests, then the post processor task co~plet2s, and the dequeue task resumes ~peration and initiates transfer of subsequent requests in the respective drive queues as described above.

- , .: ~: . - ., ~, : .. , . . ~
;, ~" : . : , :

$ ~ , Referring now to Figures llA-G, a ~lowchart di~gram illustrating operation oP the post processor task i8 ~hown. When n reguest completes in step 740, then in step 742 the task detarmines i~ th~ operation was a PW RAM operation. I~ ~o, the post processor task performs various post-proce6sing operations, depending on what type of request or operation was performed. In step 746 the task advances to the next operation 6tate of the respectiv~ ~eguest. ~ach request ~ncludes one or more operation states. For example, a P~ RAM read only includes one state, a data transfer fro~ the PW
RAM to the host. A PW RAM flush includes three states, these b~ing a transfer fro~ the PW RAM 71 to t~e controller transfer buffer RAM 46, a write from the controller transfer buffer RAM 46 to the drive array A, a~d a ~tatus update from the proce~sor 30 to the PW RAM
71. A posted write includes two operation states, these being the data transfer from the processor 30 to the PW RAM 71 and the status update to the PW RAM 71.
In step 748, the task determines the type of operation or reguest that was p~rformed. If the operation was ~ configuration update in step 748, then in step 752 the task marks the PW RAM 71 inactive.
configuration update inv~lves writing a n~w configuration or identification signature to the PW RAM
71. The PW RAM ~1 is marked inactive because a configuration updat~ only has one state, which will already have completed by step 750. Upon completion o~
step 752, the task advance~ to step 910.
I~ the operation is a read hit in step 748, then in step 762 ~Fig. 11~) the task determin~s if a read ~rror occurred to ~he P~ RA~ 71 on the read, the read having preYiously completed ~efore this task was invoked. If so, then in st~p 764 the task logs the error and in step 766 the task determines i~ the read : ':
, ~

p~ ~ s~ ~

-5~-operation was ~ read from the mirrored portion of the PW RAM 71. If oo, then the task generate~ ~ fatal error ~ignal in step 768, which di6ables posting. I~
the read was not from the mirrored portion of the PW
RAM 71 in 6tep 766, then ln ~tep 770 the task returns to the previous operation state, which in this instance is the read operation. In step 772 the task marks the reguest as a mirror~d read and in step 774 initiates the request. Thi~ request generates a read from the ~irrored portion of the PW ~AM 71. Upon completion of step 774, the ta~X returns to s~ep 740. It is noted here that the task will again advance through steps 740 - 748, ~nd the operation will again be determined to be a read hit in step ~48. If this read generates an error as determined in step 762, then in step 766 the determination of whether the read was a mirrored read will be true and the task will advance to step 768 and generate ~ ~atal error.
If a read error was deter~ined to ~ot have occurred in step 76~, then in step 776 the task disables locking of the respective line and in step 778 the task deactivates the PW RAM 71. Upon completion of step 778 or step 768, t~e task advances to ~tep 910 If th~ respective request is determined to be a ~ .
posted write in step 748, then in step 782 (Fig. llC) the ta~ determines th~ operation state of the posted write. If the posted write is in operation state 1, meaning that the write ha~ just comple ed, then in step 786 ~he ~ask determines if t~ write operation in~olved an error. If ~o, then the PW RAM 71 i5 marked inactive in step 7~8, and in step 790 the error i~ logged ~nd the volume is mar~ed as ~ailed~ If an error did not occur in st~p 786, then a ~tatus update to the PW RAM
71 iC performed, a~ described below, which includes ~7~2 ~60-~arking the respect~ve se~tor~ ~nv~lved in the write both valid and dirty.
In step 882 ~Fiy. llD), the task determines if th~
line ha6 already been dirtied, i.Q. ~ if the line already contains dirty sectors. If not, then in step 884, ~he task incre~ents a counter which counts the number of dirty lines for each of the respective lines in a ~et~ A set ccmprises ~he corresponding lines across each of t~e lS ways of the PW RAM 71, and thus a set comprises 15 lines from each of the respective ways of the PW RAM 71. If the number of dirty lines c~unted by the counter is greater than or equal to a certain threshold in ~ep 886, then ~he task advances to step 8R8. In the preferred embodiment, the threshold i5 set equal to full or lS lines.
In step 888, the task increments a counter for the number of full ~ets in the PW RAM 71. In step 890, the task disables write posting to the PW RAM. This is done because, when one set in the P~ RAM is completely filled, it i5 deke~mined that there are a large number of dirty lines in the PW RAM 71 which require flushing, and thus ~urther writes to the PW R~M 71 are disabled.
Upon completion of step 890, or if thP dirty line co~nter was not greater than or ~qual t~ the threshold in step 886, or i~ the line had already been dirtied in step 882, then in step 892 the task marks the sectors in the li~e as dirty and valid. The task also unlocks the line to indicate tha~ the posted write has completedO
~n step 894, the task 6ets the dirty lines flag.
As previously discussed, the dirty lines flag is used by the flush tssk to determinç if more lines have been dirtied in the PW RA~ 71. In ~tep 896, the ta~k enable~ the flush task at th~ current line if ~he flush task i8 not already running. In step 898, the task . .: ~. . -- - . . .

, ~ - : ., :

~arks the request a~ a ~tztus update and in step 9oo the task initi~tes transfer of the respective status informatio~ to the PW R~M 71. Upon completion o~ step 900, the task returns to ~tep 7~0 and waits for the respective operation to complete. Here it is noted that ~ter ~he dri~ operation completes the task will advance to the next operation ~tate in step 746 and then the operation state will be O in step 782.
Referring again to Figure llC, if the operation state is 0 ln step 782, meaning that the status information has already been updated, then th~ task advances to step 800. In ~tep 800, the ta~X marks the PW RAM 71 inactive and advances to step 9~o to initiate other requests. It is noted that if the operation state was determined to ~e 1 in step 782 and the task performs the operation of transmitting the sta~us in step 900 and then returns to ~tep 740; the next traversal throug~ this sequence results in the task deter~ining the operation state to be 0 in s~ep 782.
Referring again to Fis. llA, if the current request is determined to be a P~ RAM flush request in -~
step 748, then in step 812 (Figure llE) the task determines the operation state of the flush. If the operation state is determined to be ~ in step 812, then in step 81S (Figure llF) the task determines if a read error occurred in the processor read from the PW RAM
71. It is noted that in ~ PW RAM flush operation, the proce~sor 30 will read the data from the P~ RAM 71, generate any data guard infor~ation, i.e., parity writes or mirrored writes, if applicable, and then writ~ this data to the drive array A.
If a read error did occur in reading the flush data ~ro~ the PW RAM ~1 ~n step 816, th~n in step 818, the task logs th~ errQr and in step 820 determines if the read occurred fro~ th~ mirrored portion of the P~

... . . .. . ...
. . , " . :
.. ~ . ... " , ~- .
, ,. . . , ,; - , ,,,, ", -: ' ' ~.'~' ' ; - :' ' RAM 71. If ~o, then ~ f~tal error iæ generated in step 822, write posting i~ disabled, and the task then advances to step 910. If the read was ~ot from th~
m~rrored portion o~ the PW RAM 71 in step 820, then in step 824 the task restores the prQvious operation state in step 824 and marks the request as a mirrored request in step 82fi. In ~tep 828 the task initiates a read reguest fro~ the ~irrored side of the PW ~AM 71 and returns to step 740.
Ig a read error did not occur in ~he PW RAM 71 in step 816, then in step 830 the task marXs the PW RAM 71 inactive. In step 832 the task restores the original command or request to a write operation. As previously discussed, in step 384 of the scheduler task, when the scheduler task had determined that the request was a flush request, it c~anged the request from a write to a read in order to retrieve the data from the PW RAM 71.
In step 832, the post processor task restores the request to a write to now write the flush data to the drive array A.
In step fi34 the task determines if the request reguires data guard operations. If so, then in step 836, the task frees the channel for the upcoming data guard operations and in step 833 frees up the destination drive. In step 840 the task initiates the respective d~ta guard operation required as well as the data writes and then ad~anc~s to step 910. I~ the PW
RAM flush ~peration does not include data guard requests in step 834, th~n in step 842 the task writes the data retrieved ~rom the PW RAM 71 to the drive array A and returns to ~tep 740. When the operation completes in step 740, then the task ~dvances to the next operation state in step 746, and thus the operation state will be 1 the next time step 812 i5 executed.

, . .,, ,~ , , : . :.: -~ :
- ,. :~, . ,.. , ; .
., 2 ~ 2 Referring again to FigurQ llE, if the flush operation 6tate i~ determined to be 1 in step 812, then in step 852 the task deter~ine~ iP a drive error occurred on the write operation to the drive array A.
I~ not, then in step 854 the task marks t~e respective sectors clean and marks the request as ~ ~tatus update in step 856. In step 858 the task determines if the PW
RAM 71 is inactive. If ~o, then the task $nitiates the status transfer in ~tep 860 and returnR to step 740.
The next ti~e ~tep B12 is executed the speration state will be 0.
I~ the PW RAM 71 is determined to be active in step 858, then in step 862 the task releases the DMA
channel and in step 864 sets the destination of the request to the PW RAM 71. In step 866 the task requeues the reguest to the fc~s queue and then advances to step 910 (Fig. llG). If a drive error was determined to ~ave occurred on the write operation in step 852, then the task advances to step 910. No atte~pt is made to fix the drive error in this task, but rather remap operations that are perfo~med on the drive array A are designated to ~ix this occurrence.
If the operation state in step 812 is determined to be O, then in step 874 the task marks the PW RAM 71 inactive. In step 876 the task unlocks the line to indicate that the ~lush has completed.
Referring now to Figure llG, in step 910 the tasX
deter~i~es lf th~ operation has completed. I~ so, then in step 912 the task ~rees up the respective channel and sends a completion message in tep 914O It ic noted that the task will deter~ine in ~tep 910 t~at the operation has not completed if the task arrive~ here fro~ step 866. This is because the P~ RAM 71 was busy in st~p 858 a~d thus the request was requeued in step 866. Upon c~mpletion of step 914, or if the operation , . . . . . ........................ . ..

~ ~'., ,; .: ' ;, , . ~ :

~7~

had not completed in 6tep 910, in step 916 the task scans the parity channel queue for requests and initiates a transfer if the queue contains requests.
In ~tep 918 the task scans thQ fcfs queue for requests dnd initiates a transfer if the queue contains requests. When either of t~ese requests completes, then the post processor task i~ again invoked to perform any post processing required. If the parity channel gueue and the drive queue do not contain any requests, th2n the post processor task completes and the dequeue task is invoked. The dequeue task initiates other transfers from requests at the head of the respective drive queues. ~hen these requests complete, the post processor task is invoked to perform post pr~cessing as described above. This seguence of operations ~ompletes to perform the various data, parity, and ~tatus transfers between the PW RAM 71, the host, and the drive array A.

Therefore, a posted write memory which operates in conjunction with a drive array is disclosed. ~ task referred to as the flush ta6k continually scans the PW
RAM 71, coalesces partlal stripe writes into full stripe writes, and build~ logical requests in a way similar to the manner in which a host such as the comput~r processor builds a command list. The flush logical request is executed in a similar manner to the logical requests in a host co~mand list. In addition, if the PW RAM 71 becomes Sull/ it deIays partial stripe writes, but allows ~ull stripe writes or greater to pass directly to th~ drive array A, th~s increasing system ef~iciency.
~he foregoing disclo~ure and description of the invention are illustrative and explanatory thereof, and 2~7~

variou~ change~ in th~ ~ize, shape, materials, components, circuit ele~ents, and wiring connections, as wall as in the details o~ the illustratad circuitry and constructio~ and m~thod of operation may be made without departing ~rom the ~pirit of the invention.

Claims

1. An Apparatus for improving performance in a disk array, wherein the disk array includes parity fault tolerance and a plurality of stripes for storing data, the apparatus comprising:
a posting memory coupled to the disk array which receives and stores host write data intended for the disk array, wherein said host writes comprise partial stripe writes;
flushing means coupled to said posting memory for monitoring said host write data stored in said posting memory and for generating write requests comprising said host write data including one or more full stripe write operations; and means coupled to said flushing means, said posting memory, and the disk array for receiving said write request generated by the flushing means and for executing said write request to the disk array.

2. The apparatus of claim 1, wherein said flushing means coalesces contiguous partial stripe writes and incorporates said coalesced partial stripe writes into said write request.

3. The apparatus of claim 2, wherein said posting memory is organized into a plurality of sequential lines storing host write data; and wherein said flushing means sequentially examines said lines storing host write data and incorporates sequential lines including dirty data that can be flushed into said write request.

4. The apparatus of claim 3, wherein the drive array is comprised of one or more logical volumes and host write data has a destination in one of said second logical volumes, wherein groups of Raid posting memory lines correspond to respective stripes in the disk array, each of said groups including a first line corresponding to the beginning of its corresponding stripe in the disk array and a last line corresponnding to the end of its corresponding stripe in the disk array;
wherein said flushing means determines the logical volume destination of host write data in said posting memory prior to examining said lines; and wherein said flushing means begins examining said lines at the first line in a group corresponding to a respective stripe if said logical volume destination includes parity fault tolerance.

5. The apparatus of claim 4, wherein if said flushing means begins examining a line other than the first line in said group corresponding to said respective stripe, said flushing means backs up and begins examining the first line in said group if said logical volume destination includes parity fault tolerance

6. The apparatus of claim 3, wherein groups of said posting memory lines correspond to respective stripes in the disk array;
wherein said flushing means examines sequential lines and incorporates said sequential lines into said write request until said flushing means examines a line which does not include data that can be flushed.

7. The apparatus of claim 1, further comprising:
means for storing status information pertaining to each of said lines in said posting memory;

wherein said flushing means is compled to said status information storing means and examines said status information pertaining to a respective line to determine whether to incorporate said respective line into said write request.

8. The apparatus of claim 1, wherein said flushing means includes means for determining if said posting memory includes no dirty data; and wherein said flushing means discontinues operations when said determining means determines that said posting memory includes no dirty data.

9. The apparatus of claim 8, further comprising:
a main power source coupled to said posting memory; and battery back-up means coupled to said posting memory which supplies power to said posting memory if said main power source discontinues providing power;
and wherein said flushing means disables operation of said battery back-up means if said determining means determines that said posting memory includes no dirty data.

10. A method for improving disk array performance utilizing a posting memory coupled to the disk array, wherein the disk array includes parity fault tolerance and a plurality of stripes for storing data, the method comprising:
receiving host write operations, wherein said host write operations include partial stripe writes;
storing said host write operations in the posting memory;

coalescing contiguous partial stripe writes into full stripes or greater; and writing said coalesced partial stripe writes to the drive array as one or more full stripe writes.

11. The method of claim 10, wherein the posting memory includes a plurality of groups of lines storing data corresponding to respective stripes in the disk array, each of said lines storing data corresponding to a plurality of sectors on a drive in the drive array including a first sector and a last sector, wherein said step of coalescing comprises:
a) examining a line corresponding to a first stripe to determine if the line includes dirty data that can be flushed;
b) determining if the last sector in said line being examined contains dirty data that can be flushed;
c) advancing to step h) if the last sector in said line being examined does not include dirty data that can be flushed in step b);
d) examining the line subsequent to said line being examined if the last sector in said line being examined includes dirty data than can be flushed in step b) and said subsequent line is present, said subsequent line now becoming the line being examined;
e) determining if the first sector in said line being examined contains dirty data that can be flushed if said line being examined is present;
f) coalescing the dirty sectors in said line being examined with prior examined lines if the first sector in said line being examined includes dirty data that can be flushed and said line being examined is present;
g) examining the line subsequent to said line being examined and returning to step (e) if said line being examined is present and all of the data in said line being examined includes dirty data that can be flushed and said subsequent line is present, said subsequent line now becoming the line being examined;
and h) assembling a write request comprising the dirty sectors of said examined lines; and wherein said step of writing comprises executing said write request to the drive array.

12. A method for flushing data from a posting memory to a disk array which includes a plurality of stripes storing data and parity fault tolerance, wherein the posting memory includes a plurality of groups of lines storing data corresponding to respective stripes in the disk array, each of said lines including data corresponding to a plurality of sectors on a drive in the drive array, each of said lines including data corresponding to a first sector and a last sector, the method comprising:
a) examining a line corresponding to z first stripe to determine if the line includes dirty data that can be flushed;
b) determining if the last sector in said line being examined contains dirty data that can be flushed;
c) advancing to step h) if the last sector in said line being examined does not include dirty data that can be flushed in step b);
d) examining the line subsequent to said line being examined if the last sector in said line being examined includes dirty data than can be flushed in step b) and said subsequent line is present, said subsequent line now becoming the line being examined, e) determining if the first sector in said line being examined contains dirty data that can be flushed if said line being examined is present;
f) coalescing the dirty sectors in said line being examined with prior examined lines if the first sector in said line being examined includes dirty data that can be flushed and said line being examined is present;
g) examining the line subsequent to said line being examined and returning to step (e) if said line being examined is present and all of the data in said line being examined includes dirty data that can be flushed and said subsequent line is present, said subsequent line now becoming the line being examined;
and h) assembling a write request comprising the dirty sectors of said examined lines.

13. A method for flushing data from a posting memory to a disk array which includes a plurality of stripes storing data and parity fault tolerance, wherein each of the stripes includes a plurality of sectors and wherein the posting memory includes a plurality of groups of lines storing data corresponding to stripes in the disk array, wherein each of said groups may store a first line corresponding to the beginning of its corresponding stripe in the disk array and a last line corresponnding to the end of its corresponding stripe in the disk array, each of said lines corresponding to a plurality of sectors on a drive in the drive array including a first sector and a last sector, the method comprising:
a) examining a middle line corresponding to a stripe to determine if the line contains dirty data that can be flushed;

b) determining if parity fault tolerance is being implemented on the volume including said stripe;
c) locating the first line corresponding to the beginning of said stripe if parity fault tolerance is being implemented on the volume including said stripe, said middle line being examined is not said first line, and said first line is present;
d) determining if said first line corresponding to the beginning of said tripe includes dirty data that can be flushed after said step of locating if said first line is located in step c);
e) coalescing dirty sectors in said first line as well as subsequent lines into a write request and advancing to step (i) if said first line is present and includes dirty data that can be flushed;
f) returning to said middle line if said first line is not present or does not include dirty data than can be flushed;
g) examining said middle line after said step of returning if said first line is not present or does not include dirty data than can be flushed;
h) coalescing dirty sectors in said middle line as well as subsequent lines into a write request after executing steps f) and g); and i) generating a write request comprising said coalesced dirty sectors.

14. An apparatus for improving performance in a disk array which includes a plurality of stripes for storing data and parity fault tolerance, wherein writes intended for the disk array include corresponding write data having a given size, the apparatus comprising:
a disk array;
a posting memory receiving and storing said disk array write data;

means for delaying storing of write data corresponding to a disk array write if said posting memory is full and said write data is less than a first size; and means for writing disk array write data directly to the drive array if said posting memory is full and said write data is greater than or equal to said first size; and wherein said write data greater than or equal to said first size is not written to said posting memory.

15. The apparatus of claim 14, wherein said first size is equivalent to a full stripe write if the drive array is using parity fault tolerance.

16. The apparatus of claim 15, further comprising:
means for delaying storing write data corresponding to a disk array write if said posting memory is full and said write data is less than a second size if the disk array is not using parity fault tolerance;
means for writing disk array write data directly to the drive array data if the posting memory is full, the disk array is not using parity fault tolerance, and said write data is greater than or equal to said second size;
wherein said second size is less than said first size.

17. A method for improving performance in a disk array which includes a plurality of stripes for storing data and parity fault tolerance, wherein writes intended for the disk array include corresponding write data having a given size, the method comprising:

receiving a disk array write request having a certain size and a destination in the disk array;
determining if the posting memory is full;
determining if parity fault tolerance is being used in said disk array destination after said step of receiving;
delaying storing write data corresponding to said disk array write request if said posting memory is full, parity fault tolerance is being used in said disk array destination, and said write data is less than a first size; and writing said write data directly to the drive array if said posting memory is full, parity fault tolerance is being used in said disk array destination, and said write data is greater than or equal to said first size.

18. The method of claim 17, further comprising:
delaying storing said write data if said posting memory is full, parity fault tolerance is not being used in said disk array destination, and said write data is less than a second size; and writing said write data directly to the drive array if said posting memory is full, parity fault tolerance is not being used in said disk array destination, and said write data is greater than or equal to said second size;
wherein said second size is less than said first size.

19. A method for improving performance in a disk array which includes a plurality of stripes for storing data and parity fault tolerance, wherein writes intended for the disk array include corresponding write data having a given size, the method comprising:

delaying storing write data corresponding to a disk array write if said posting memory is full and said write data is less than a first size;
writing directly to the drive array write data corresponding to a disk array write if said posting memory is full and said write data is greater than or equal to said first size.

20. The method of claim 19, wherein said first size is equivalent to a full stripe write if the drive array is using parity fault tolerance.

21. The method of claim 20, wherein said step of delaying delays storing write data corresponding to a disk array write if said posting memory is full and said write data is less than a second size if the disk array is not using parity fault tolerance;
wherein said step of directly writing writes directly to the drive array write da a corresponding to a disk array write if said posting memory is full and said write data is greater than or equal to said second size if the disk array is not using parity fault tolerance; and wherein said second size is less than said first size.