CA1270333A

CA1270333A - Parity spreading to enhance storge access

Info

Publication number: CA1270333A
Application number: CA000535598A
Authority: CA
Inventors: Brian Eldridge Clark; George David Timms, Jr.; Francis Daniel Lawlor; Werner Eric Schmidt-Stumpf; Terrence James Stewart
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1986-06-12
Filing date: 1987-04-27
Publication date: 1990-06-12
Also published as: JPS62293355A; US4761785B1; DE3750790D1; US4761785A; EP0249091B1; JPH0547857B2; DE3750790T2; EP0249091A2; EP0249091A3

Abstract

ABSTRACT OF THE DISCLOSURE
A storage management mechanism distributes parity blocks corresponding to multiple data blocks substantially equally among a set of storage devices. N storage units in a set are divided into a multiple of equally sized address blocks, each containing a plurality of records. Blocks from each storage unit having the same address ranges form a stripe of blocks. Each stripe has a block on one storage device containing parity for the remaining blocks of the stripe. Further stripes also have parity blocks, which are distributed on different storage units. Parity updating activity associated with every change to a data record is therefore distributed over the different storage units, enhancing access characteristics of the set of storage devices. The parity updating activity also includes the use of an independent version number stored with each data record and corresponding version numbers stored with the parity record. Each time a data record is changed, its version number is incremented and the corresponding version number in the parity record is incremented with the parity record update.

Description

r PARITY SPREADING TO ENHANCE STORACE ACCESS

Background of the Invention The present invention relates to maintaining parity infor~ation on multiple blocks of data and in particular to the S storage of such parity information.

U. S. Patent No. 4,092,732 to Ouchi describes a check sum generator for generating a check sum segment from segments of a system record as the system record segments are being transferred between a storage subsystem and a central processing unit. The check sum æegment is actually a series of parity bits generated from bits in the same location of the system record segments. In other words, each bit, such as the first bit of the check sum segment is the parity of the group of first bits of the record segments. When a storage unit containing a record segment fails, - 15 the record segment is regenerated from the check sum seg~ent and ~ the remaining system segments. One s~orage unit is sPlected for ;~ containing all the check sum segments for a plurality of record storage units.
-..
In the Ouchi patent, the check sum segment is al~ays generated from reading all the record segments it covers. If one rPcord segment is changed~ all the record segments covered are read and the checksum segment 16 generated. An IB~*Technical Disclosure Bulletin, Vol. 24, No. 2, July 1981, pages 986-987, Efficient Mass Storage Parity Recovery Mecha~ism, improves upon the generation of the checksum segment, or parity segment by copying a record segment before it is changed~ The copy of the record segment is then exclusive-ORed with the changed record segment to create a change mask. The parity segment is then read and exclusive-ORed with ~he change mask to generate the new parity segment which is then written back out to the storage unit.

While a number of reads on record segments that are not changed is avoided in the prior art, a single storage unit iS used * Registered Trade Mark : ::

:~ 2~
RO9 - ~6 - O 1.4 2 to store parity segments for multiple record segments on multiple storacJe devices. A read and a write on the single storage unit occurs each -time a record is changed on any of the storage units covered by the parity record on the s;ngle storage uni-t. Thus, the single storage unit becomes a bottle-neck to storage operatlons since the number of changes to records which can be made per unit of time is a function of the access rate of the single storage uni-t as opposed to the -Faster access rate provided by parallel operation of the multiple s-torage units.

Recovery of a lost record depends on the synchronization of the parity record with each of the data records that it covers. ~ithout special hardware, such as non-volatile storage, and/or additional write operat70ns to storage units, it is dlfficult to guarantee tha-t both the data records and parlty record are updated to a consistent state iF the system terminates abnormally~ Since two I/0 operations are required to upda-te the data and its associated pari-ty, it is difficult to determine which I/0 operation has completed following the system termination.

Summary_of the Invention ~ storage management mechanism dis-tributes parity information substantially equally among a set of storage units. N storage units in a set are divided into a plurality of e~ually si7ed address areas referred to as blocks. Each storage unit contains the same number of blocks. 810cks from each storage unit in a set having the same unit address ranges are referred to as stripes. Each stripe has N-1 blocks oF data and a parity block on one storage device containing parity For the remainder of the stripe. Further stripes each have a pari-ty block, the pari-ty blocks being distributed on different storage units.
Parity updating activity associated with every modification of data in a set is therefore distributed over the different storage units. No single unit is burdened wi-th all the parity update activity.

~27~33~
P~09-86-014 3 In the preferred embodiment, each storage unit participating in a set is the same size. This permits a simplified definition of the stripes. Since each storage unit is the same size, no storage unit has areas left over which need to be handled separately.

The number of storage uni-ts in a set is preFerably more than two.
With just two units, the protection is similar to mirroring, which involves maintaining two exact copies of data. With more than two units, the percent of storage dedicated to protection decreases. With three units, the percentage of storage needed to obtain the desired protection is about 33 percent. With eight units, sl;ghtly more than 12.5 percent of storage capacity is used for protection.
' .

In a further preferred embodiment, N-1 data records (520 bytes) and 1 parity record in a set having the same address range is referred to as a slice. Each data record in the sllce has a version indicator wh~ch lndicates the version o~ the recorcl. A header in each parity record comprises a plurality of vers;on indicators, one corresponding to each of the data records in the slice. Each time a data record is updated, its version indicator is incremented and the parity record version indicator corresponding to that data record is also incremented. When both the record update and the parity update are complete, the version indicators are equal. During recovery of a lost record, the version numbers are checked to ensure synchronization of the records with the parity. Forcing recovery without valid synchronization would produce unpredictable data.

Because each data record has a version number that is independent of every other data record, no serialization is required for updates to diFferent da-ta records covered by the same parity record. Also avoided, is the need to read the parity record from storage before scheduling the data record write operation and queuelng a parity update request.

~, ?33;~

RO9-~36-014 In a Further preferred embodiment, an unprotected stripe is provided -for data records which need not be covered by parity groups.
The unprotectecl stripes need not be the same size as the protected str;pes, and may include variable size areas on the storage units if the storage units are not identical in size. Such striping provides a convenient method of segregating the storage units into areas of protected and unprotected storage because the same address area of each storage deYice is subject to protec-tion. Performance benefits are realized if it is unnecessary to protect all the records stored on the un;ts because there is no parity update required after a record in an unprotected stripe is changed.

_ief Desc~ption of the Dr_ing~

Fig. 1 is a block diagram of a system incorporating the parity block protect;on distribu-t;on of the present invention.

Fig. 2 is a block diagram oF the distribution of the parity blocks of Fig. 1 over a plurality of storage devices;

Fig. 3 is a block diagram representation of logical tables used to correlate parity groups and data records;

Fig. 4 is a block diagram of records in a stripe of storage util;zing version ind;cations for synchronization;

F;g. 5 is a flow diagram of the initialization of storage devices for par;ty block protection; and Fig. 6 is a flow diagram of steps involved ;n updat;ng data records and their corresponding parity records.

2~33~
R0~-86-014 5 Detai ed D scri~ n_of the Pr ferred Embodimenk A computer system implementiny block parity spreading is indicated generally at 10 in Fig. 1. System 10 comprises a data processing unit 12 coupled to a control store 1~ which provides fast access to microinstructions. Processor 12 communicates via a channel adapter 16 and through a high-speed channel 18 to a plurallty of I/O
units. Processor 12 and the I/O units have access to a main storaye array 20. Access to main storage 20 is provided by a virtual address translator 22. Address translation tables in main storage 20, and a translation lookaside buffer provide mapping from virtual to real main storage addresses.

Each I/O device, such as disk drives 30, 32, 34, 36, and 38 is coupled through a controller, such as a disk s-torage controller ~0 for the above disk clrive storage devices. I/O controller 42 controls tape devices 44 and 46. Further I/O controllers 48, 50 and 52 control I/O
devices such as prin-ters, workstations, keyboards, displays, and communications. There are usually multiple disk storage con-trollers, each controlling multiple disk drive storage devices.

Data in system 10 is handled in the form of records comprising 512 byte pages of data and 8-byte headers. In the preferred embodiment, an IBM System/38, records are moved lnto and out of main storage 20 from disk storaye via the channel 18. A main storage controller 60 controls accessing and paying of main storage 20. A
broken line 62 between channel 18 and maln s-torage controller 60 indicates direct memory access to main storaye 20 by the I/O devices coupled to the channel. Further detail on the general operation of system 10 ;s found in a book, IBM System/38 Technical Developments, International 8usiness Machines Corporation, 1978.

Protection of data on the disk storage devices 30 through 38 is provided hy exclusive ORing data records on each device, and ~76~333 R09-~36-Ol~ 6 storing the parity record resulting from the exclusive OR on one of the storage devices. In Fig. 2, each storage device 30 through 38 is divided in-to blocks of data and blocks of parity. The blocks represent physical space on the storage devices. Since the sys-teM 10 provides an exten-t (a contiguous piece of allocatable disk space) of up to 16 megabytes of data, each block is preferably 16 megabytes.

Bloc~s 70, 72, 74, 76 and 78, one on each storage device, preferably having the same physical address range, are re-ferred to as a stripe. There are 9 s-tripes shown in Fig. 2. Each protected stripe has an assoc;ated parity block which contains the Exclusive OR of the other blocks in the stripe. In the first stripe, block 70 contains the parity for the remainlng blocks 72, 74, 76 and 78. A block 80 on storage device 32 contains the parity for the remaining blocks on the second stripe. Block 82, on storage device 3~ contains the parity for the thlrd stripe. ~locks 8~ and 86 contain the parity for the fourth ancl fifth stripes respectively. The parity blocks, including blocks 88, 90, and 92 for stripes 6, 7 and 8 are spread out, or distributed over the storage devices. The 9th stripe is an unprotected area which does not have a parity block associated with it. It is used to store data which does not need protection from loss.

Spreading of the parity information ensures that one particular storage device is not accessed much more -than the other storage devices during writing of -the parity records following updates to data records on the different stripes. A change to records stored in a block w-ill result in a change also having to be made to the parity block f~r the stripe including the changed records. Since the parity blocks for the stripes are spreacl over more than one storage device, the parity updates will not be concentrated at one devlce. Thus, I/O
activity is spread more evenly over al the storage devices.

~' ~' In Fig. 3, a unit table 310 contains information for each storage uni-t which participates in the parity protection. A physlcal address comprising a unit number and a sector, or page number is used to ident;-Fy the location oF the clesired da-ta. The unit table is then used to identify parity contrnl blocks indicated at 31~, 316, and 318.
Units 1-8 are members of the first parity set associated with control block 314. Units ~-13 are members of the second parity set associated with control block 316, and units N-2 to N are members of the Ith parity set associated with con-trol block 318.

Each entry in the unit table points to the control block associated with the set of storage devices of which the entry is a member. The control blocks identify which unit of the set contains the parity block for each stripe. In control block 31~, the stripe comprising the first 16 megabytes of storage has its parity inForma-tion stored in urlit number 1 in unit number 1's First 16 megabytes. The seconcl 16 megabytes of -the stripe comprlsing unlts 1-8 is contained ln the second 16 megabytes oF unit number 2. The parity block allocation continues in a round robin manner with units 3-8 having parity for the next 6 stripes respectively. The ninth stripe in the first pari-ty set has its parity stored in the ninth block of 16 megabytes on unit number 1. The last stripe, allocated to unit J, one of the eight units in the set, may not contain a full 16 megabytes depending on whether the addressable storage of the units is divisible by 16 megabytes A header in each of the parity control blocks describes which units arP in the set, and also identifies an address range common for each of the units which is not pro-tected by a parity group. Having a common range for each unit which is not protected, simplifies the parity pro-tection scheme. Since the same physical addresses on each storage device are exclusive ORed to determlne the parity information, no special tables are required to correlate the information to the parity okher -than those shown in Fig. 3. The common unpro-tected address range requires no special ~ .

33~
R09-86-Ol4 consideration, since the identifica-tion of the range is in the control block and is common for each unit.

Parity control block 31~ corresponds to the set of storage units 9-13 in Fig. 3. These five units may be thought of as s-torage units 30 - 38 in Fig. 2. The allocation of parity groups to storage un-its is on a round robin basis. Each consecutive 16 megabytes of storage has i-ts parity yroup stored on consecutive storage devices starting with device 30, or unit number 9 in Fig. 3. Unit number 9 also con-tains the parity group for the 80-96 megabyte range of the stripe for storage units 9-13 (30-38). The last stripe in -the set has its parity stored on the Kth unit, where K is the unit where the allocation of parity blocks ends because there are no more protected stripes.

Control block 318 corresponds to the set oF storage units N-2 through N in the Ith set oF storage units. The last unit allocatecl a parity block ls labeled L, and is one of the three units In the set depending on the number of stripes in the set. The Ith set contains the minimum number of storage units, three, considered desirable for implementation of the parity protection. The use of two units would be posslble, but would be similar to mirroring, with the extra step of an exclusive OR. Eight units in a set has been selected as the maximum in the preferred embodiment due to a system specific constraint to be discussed below. More than eight units may be used irl a set without loss of protection.

With a very large number oP units in a set, reconstruction of the data lost when a single unit fails would take a longer time because each unit would have to be read. There is also an increased chance of loss of more than one unit at a time. If this occurs, it ;s not possible to reconstruct the data from either of the lost unlts using the simple parity discusse-l above. The invention is consldered broad enough to cover a more complex data protection code, which may be stored similarly to the parity, and permit multiple bit correction in the event more than one storage ' .

. .
-,: .

7~33~

R09-86-Ol4 9 device fails. A set could also be arranged multidimensionally as described in the IBM TDB, Vol 24, No. 2 Pages 986-987, to permit reconstruction of clata from at leas-t two failed units. Further embodiments spread the parity information based on frequency of updat;ng data to spread the I/O activi-ty evenly, as opposed to spreading the parity itself evenly.

Each data record contains a version number. Since updates to multiple data records, covered by one parity record may occur, each record in a parity block also contains a corresponding version indication for each record in the slice it covers. A slice is a set of data records and their corresponding parity record. The version indications are not coverecl by the parity protection scheme. In Fig.
~, four data records, 410, ~12, 41~ and ~16 each contain a header wi-th a record version number indicated at ~18, ~20, 422 and 424 t respectively. A parity record 426 contains a header with four version numbers, 42~, ~30, 432 and 434 corresponding to the version numbers in the data records. The version numbers or indications may be any length compatible with the nurnber of blts available in the record headers. A one bit length was chosen due to the unavallabllity of further bits. Each time a data record is changed, its version number is incremented. The correspondlng version number in the parity record is also incrernented so that they have the same value.

The version numbers are used to check for synchronization of the parity record with each data record in a slice in the event of lost data. When changes to several data records occur, the data records may be written -to disk storage before or after the parity records are updated with the change masks. The change masks are queued for incorporation into the parity records. Assoclating a version number with each parity and data record adds a constraint that parity record updates for a given data record -to disk storage must be processed in the same order as the da-ta record updates. Otherwise, the vers~on numbers may not be accurate. A FIFO queue holds the change masks so that they are incorporated Into the Ro9-86-014 10 parity records on mass storage in the order that the change masks were generated.

Spec-ial consideration is given to the version numbers due to the limited availability of bits -in the héaders. The version number for each data record is stored in -the first bit position of the 6th byte of the respective records. The corresponding version numbers in the parity record are contained in the first 4 bit positions oF the 6th byte and tha first 3 bit positions of the gth byte o~ the header. The bit positions for the parity record header version numbers will be referred to as bits 1-7 corresponding to the order described above.
The units in a set are numbered 1-n corresponding to their order in the parity control block for the set. The version numbers are stored in ascending order, based on the unit number, in the parity record : headers, skipping the parity unit. If the third unit is the parity unit, the version numbers corresponding to the first two un;its are stored in bit positions 1 and 2. The version number corresponding to the fourth unit is stored in bit position 3. The nth unit version number is in position n-1 in the parity record header. Storing version numbers ;n this manner permits the largest possible set given the storage limitation. With extra storage, the size of the set may be optimized as a function of other system considerations. The positioning of the version numbers in the parity header could also have a straight forward unit number to bit position correspondence.

Because each data record has a version number that is independent of every other data record, no ser-ialization is required for updates to different data records covered by the same parity record. Transfer of records into and out of main storage may be based on other system through-put considerations to improve processing speed. The separate versiorl numbers also eliminate the need to read to parity records from mass storage before schedul-ing a data record write operation and queueing a parity update request.

~ .

~7~)3~3 R09-86-01.4 ll In order to limi-t the amount o-f storage occupied by a version number, it is allowed to wrap -from -the lar~es-t to the smallest value without error. This enables the ~ersion number to be reused. If a one bit version number is used, it wraps from 1 to 0. Thus, there are two values it assumes. A version number with a higher number of bits allows more values. Since data record and parity record updates are asynchronous, an update to a data record ;s held if the updated version number could be con-fused w;th an ex;st;ng vers;on number.
Such confus;on could ex;st if the same vers;on number as the vers;on number assoc;ated with the update, could exist on the data record or the par;ty record on mass storage. The t;me frame that must be cons;dered is from the time a new parity update request is placed on the FIFO queue to the t;me -the request ;s completed. Both the data record and the par;ty record must be updated before an update request is cons;dered complete.

The vers;on numbers that could ex~st on d;sk before a new update request is completed include all the values For prior update ~equests that are st;ll on the queue plus the value that precedes the f;rst (oldest) request element in the queue. If an update request is not removed from the queue unt;l ;t ;s completed, ;t ;s only necessary for a new update to wa;t ;f there are other requests on the queue for the same data record, and the incremen-ted version number for the new data record matches ~he version number preced;ng the f;rst request element st;ll ;n the queue.

The performance cost of search;ng the queue before schedul;ng an update ;s not severe as long as the number of update requests ;n the queue remains small. A reasonably fast access time to the storage used For the queue also reduces the per-Formance cost. Mainstore provides a su;table storage area For the queue. Keep;ng the number of update requests ;n the queue small ;s also important to ensure that no request has an excess;ve wa;t time.
.

~hen, as ;n the preferred embodiment, the version numbers are ;mplemented as slngle bit Flags, ;t is only necessary to search the ~7~?33~
R09-~36-0]~ 12 queue for any prior update reques-t -to the same data record ;n order to determine whether a new update must wait. There is no need to check version number values, because a new request must always wait if there is an incomplete request ln the queue for the same data block.

In the preferred embodiment, the record headers contain seven unused bits available -for implementation of the present invention.
lhis places a limit on the number of units which may participate in a set. Only seven version numbers may be kept in a parity record, so only up to eight total units participate in a parity protection set.
As mentioned before, many more units, or as little as three units could efficiently make use of the present invention.

CONFIGURATION OF SYSTEM FOR PARITY PROTECTION
.

Configuration of the system for parity protection of data is lnitiated by a user at 510 in the flow diagram of F;g. 5. A set-up task on processor 12 builds the parity control blocks 31~ - 318 in block 512 of the flow diagram. The task uses information in a configuration unit table 312 which identifies the storage units coupled to the system. A storage device, such as the IBM 3370 may have more than one independent unit in it, such as an independently :~ accessible arm, but only one unit from a storage device is chosen for any part~cular set. Another criterion for unit selection involves using as many disk controllers as possible in a set. These criteria are used so tha-t a failure mode will not affect two units in a set.
The set-up task also maximizes useable capacity by max;mizing the ~ number of units in a set.
:
: After the control blocks have been built, the stripe arrays in the control blocks are built at 514. The stripe arrays, as previously mentioned assign parlty blocks on a round robin basis to member un;ts ; in each set. The stripe arrays are formed to indicate which units con-tain the parity blocks for successive stripes compris;ng 16 megabyte blocks from each unit. A user may also ~L~7~`33;3 Ro9-86-014 l3 define a size of unprotected storage desired. A definition of the address range of the unprotected stripe is enterecl in the header of the control block. The setup task will not define a block from the unprotected stripe as a parity block, so the entire unprotected stripe will be available for data records.
I

Next, the set up task writes the control blocks at 516 to more than one member unit o~ each set to which the control blocks correspond. The control blocks are used during recovery to identify the members of the sets. Thus, they must be available without parity recovery. They are written to more than one member unit of each set in case one of the member units containing it fails. In the event that the protection scheme protects agains-t failure of more than one unit, the control blocks are written to at least one more unit than the number of units which may be recovered. In the preferred embodlment, they are written to each member of the set so that the units need not be searched to cletermine which unlt contains the control block. Now that the sets are identified, and the control blocks built, block 518 of the set-up task validates the parity blocks including the version numbers. In the preferred embodiment, this is done by zeroing all the clata on all the units. Since even parity is used for the parity protection, the result is a valid parity for all the data. The version numbers are also zero to start with.

The system is then initiali7ed in a standard manner at 520 by causing an initial program load.

It is also possible to add a member to a set which does not yet have the maximum number of units. The new member is preferably zeroed, and added to the unit table. The parity blocks of the se-t are then redistributed to include the new unit. The control block for the set ls then revised, and the units which contained parlty blocks that were transferred to the new unit have their address ranyes whlch contained the parity blocks zeroed, valldating the parity group for that stripe.

~27~3~3 R~ 86-Ol4 14 In the preferred embodimen-t, the control block is temporarily revised to ;ndica-te a change in the set will occur due to the addition oF the new un;t. The temporary change is done ;n case a Failure occurs durlng redistr;bution of the parity blocks. The parity blocks of an existing set are then redistributed to include the new unit.
The units which contained parity blocks that were transferred to the new unit have their address ranyes which contained the parity blocks ~eroed. When redistribution is complete, the changes to the control block are made permanent.

It is also possible to add a new unit without transferring parity blocks~ Not transferring the parity blocks would fail to make use of the increase in access rate of the set possible by adding the new un~t. The new unit would however be protected by the existing parity blocks.

UPDATING RECORD AND PARITY IN A SLICE

In Fig. 6, changing a data record and its corresponding parity record is shown. Such a change may be called for by a person chang;ng data in a data base, or by a machine process requesting a change for a number of reasons. At 610, a user task operating on processor 12 reads the data record to be changed. The user task first makes an extra copy of the data record at 612 and then makes the changes to the data record in a conventional manner.
,~

The user task then creates a change mask at 61~ by exclusive OR;ng the changed data record into the extra copy of the data record.
The first bit of the changed data is exclusive ORed with the first bit of the extra copy to form the f-irst bit of the change mask. Each consecutive b;t of the changed data is similarly exclusive ORed with corresponding bits of the ex-tra copy to form further bits of the change mask. An already exis-ting mach;ne instruction performs the exclus;ve OR of the headers of the changed record and the cop:ied record. Two further exclusive OR machine instructions perform exclus;ve ORs of the pages in 256 byte blocks.

~LZ7~3313 R09-86-01~ 15 The positions in the header corresponding to t~e version numbers in the parity record are then ~eroed in the change mask so that they will not affect the version numbers of the parity record.

The version number in the data record header is then incremented by the user task at block 616. A user task at 617 then determines which unit contains the appropriate parity record by search~ng the unlt table 312 based on the unit number from the physical address of the new data. The unit table indicates which control block 314-318 to use to identify the unit containing the parity record for the particular data record. Again, the address of the new data record is used to identify the stripe of interest. Once the unit containing the parity record for the stripe is identified, task 617 places an update request which includes the record address and change mask on a queue 624 for the appropriate unit. The update request also indica-tes that the data record has not yet been written. Prior to task 617 placing an update request on queue 624, it searches queue 624 to ensure there can be no confuslon with vers;on numbers as discussed previously. If confusion is possible, the user task waits until confus;on is not posslble before proceeding.

The address of the parity record is known to be the same as the address of the new data except that it is on a different unit. A
write request for the new data is issued at 618 to a queue 620. When the data record is written to storage 622, an indication is sent of that fact at 625 to the update request on queue 624. Flow is then returned to wait for the next data record change.
`:
A parity update task starts at 627 by gettlng the next parity update request from queue 624. Once the parity record identified in the update request has been read at 626~ the change mask is exclusiYe ORed at 62~ i nto the parity record. The version numbers are not changed by the exclusive OR

~ 333 because the change mask contains zeros in the bit positions corresponding to the version numbers. The exclusive OR is performed using the same exclusive OR machine instruction as is used by block 614 of the user task. The parity record version number corresponding to the data record is then incremented at 630, and a write request for the parity record is issued at 632. The parity update task then gets the next parity update request at 627.

A queue 634 stores parity record wr;te reques-ts for the storage units. ~n this case, a storage unit 636 is indicated for the particular write request. Storage unit 636 is different than unit 622 because the data and the parity records are not to be written to the same unit. When the write of the parity record completes, the update request on queue 624 is notified at 637. When both the data write and the pari~.y write are completed, the entry is removed From queue 624, indicating that there is no longer the possibllity of confusion with the version number.

RECORD RECOVERY

Recovery is performed when either a single record read error is encountered dur;ng normal operation of the system or when an entire unit fails.
~' When a unit fails, the data lost on the unit is reconstructed from the remainjng units of the set, as indicated in the control blocks stored on more than one member of the set. The failed unit is replacecl or repaired. ~ata for the new unit is then reconstructed from the remaining members of the set record by record. A parity record is reconstructed simply by reading the data records in the set and regenerating the parity. The version numbers in the parity record are set equal to the corresponding version numbers in the data records.

When regenerating a data record, a check is first made on each of the data records -to determine if their ~ersion numbers match the version numbers in the parity record for that slice. XF any of the versinn numbers in the slice do not match, a lost data indication ~7~ 33 R09-~6-014 17 is written in the header of the lost record. If the version numbers match, the records in the slice are then exc1usive ORed one by one into the new record. lhe appropriate version number is then copied from the parity record into the new record heacler.

If a read error is encountered on either a data record or a parity record, the contents of the failing record are reconstructed with the same mechanism as described above for recovery of an entire unit. In this case, i-t is necessary to hold all change activity on the slice containiny the failed record while recovery is in progress.

Following reconstruction, normal operation of the system continues, with the only data lost being that for whlch an update of data or parity was made and a unit failed before the corresponding parity or clata record could be wrltten. Thus, the vast majority of data on the failed unit was recovered without the unit redundancy overhead of mirroriny. By dlstrlbuting the parity lnformation over the members of the set as opposed to tylng up one device with the parity information, parallel operation of storage devices is utilized to provide maximum access rates.

Wh;le the invention has been descr-ibed with respect to one or more preferred embodiments~ and with respect to one particular system, it will be recognized by those skilled in the art that the invention can take many forms and shapes. The record sizes are in no way limited to those discussed, nor are the storage units limited to disk drive devices. The fact that only identical devices are used in sets is merely a ma-tter of design choice for simplification. Numerous combinations o~ storage units in sets, and the distribution of the data protection or parity blocks are within the scope of the invention as described and as claimed below.

Claims

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:

1. A data protection mechanism for a computer system having multiple independently accessible storage devices which store blocks of data, the data protection mechanism comprising:
generator means for generating parity blocks as a function of sets of data blocks, said data blocks in a set corresponding to one parity block being stored on different storage devices;
storage management means for managing the storage of data blocks and parity blocks onto the storage devices; and spreading means coupled to the storage management means for identifying a storage device to the storage management means on which each parity block is to be stored such that no one storage device contains the parity blocks for all of the groups of data blocks.

2. The data protection mechanism of claim 1 wherein the spreading means substantially uniformly distributes the parity blocks to storage devices.

3. The data protection mechanism of claim 2 wherein the spreading means distributes the parity blocks in a round robin manner.

4. The data protection mechanism of claim 1 wherein each data block in a set, and its corresponding parity block form a stripe of the same address ranges of each of the storage devices.

5. The data protection mechanism of claim 4 wherein at least one stripe of same addresses comprises data blocks without a parity block.

6. The data protection mechanism of claim 4 wherein there are at least as many stripes as there are blocks of data in a group plus one for the parity block.

7. The data protection mechanism of claim 4 wherein the size of the address range is at least as large as the largest system allocatable contiguous piece of storage.

8. The data protection mechanism of claim 1 wherein each storage device has the same address size.

9. The data protection mechanism of claim 1 wherein a set comprises at least three storage devices.

10. The data protection mechanism of claim 1 wherein the storage devices in a set are selected to minimize the number of storage devices affected by a failure of a single component of the computer system.

11. The data protection mechanism of claim 1 wherein a block comprises a plurality of predetermined sized records, the mechanism further comprising version generator means for providing independent version numbers to data records having the same address in a set and corresponding version numbers to the parity record covering such set of data records.

12. The data protection mechanism of claim 11 wherein the version numbers comprise counters, and the version generator means increments the counter in a revised data record and increments the corresponding counter in the parity record.

13. The data protection mechanism of claim 12 and further comprising data recovery means for recovering records lost on a failed storage device by combining the remaining records in the set of storage devices.

14. The data protection mechanism of claim 13 wherein the recovery of a lost record from a device is contingent on each remaining record in the one set of records having version numbers matching the versions numbers in the parity record.

15. The data protection mechanism of claim 1 and further comprising change mask means for producing a change mask for a parity record when a data record in the set is to be changed, said change mask being generated as a function of the original data record and the changed data record.

16. A method of protecting data stored on a plurality of memory devices comprising:
dividing addressable memory on each of said memory devices into blocks of memory such that each memory has the same number of blocks, the blocks on each memory having the same address range comprising a stripe;
storing parity information for each stripe of memory blocks in a distributed manner across the memory devices to enhance the overall access rate of the memory devices; and changing the parity information for each stripe as a function of a change to a block in its corresponding stripe without reading all the blocks in the stripe.

17. The method of claim 16 wherein the step of changing the parity information comprises the steps of:
reading the data to be changed;
making a copy of the data to be changed;
making the changes to the data;
generating a change mask as a function of the copy of the data to be changed and the changed data;
writing the changed data to memory;
reading the corresponding parity data;
applying the change mask to the parity data to update it;
and writing the updated parity data to memory.

18. The method of claim 16 wherein the step of changing the parity information further comprises the step of generating a version number which is stored with both the block of data and the parity block.

19. A data protection mechanism for a computer system having multiple independently accessible storage devices which store records of data, the data protection mechanism comprising:
generator means for generating parity records as a function of sets of data records, said data records in a set corresponding to one parity record, each of said data and parity records being stored on different storage devices;
storage management means for managing the storage of data records and parity records onto the storage devices; and version generator means for providing independent version numbers to data records in a set and corresponding version numbers to the parity record covering such set of data records.

20. The data protection mechanism of claim 19 wherein the version numbers comprise a 1 bit counter in the data records and a number of one bit counters in the parity record equal to the number of data records in the set.

21. The data protection mechanism of claim 19 wherein reconstruction of a record in a set is conditional upon the version numbers in the remaining records being consistent with the corresponding version numbers in the parity record.

22. The data protection mechanism of claim 19 wherein the version numbers comprise a multibit counter in the data records and a number of multibit counters in the parity record equal to the number of data records in the set, wherein said counters count in a predetermined sequence of values.