CN1902620B

CN1902620B - Virtual file system

Info

Publication number: CN1902620B
Application number: CN2004800398047A
Authority: CN
Inventors: 史蒂文·W·罗斯; 尼尔·A·罗茨; 科里纳·G·阿布杜尔
Original assignee: PANGRAC AND ASSOCIATES DEV Inc
Current assignee: PANGRAC AND ASSOCIATES DEV Inc; INTERACTIVE CONTENT ENGINES LLC
Priority date: 2003-12-02
Filing date: 2004-12-02
Publication date: 2011-04-13
Anticipated expiration: 2024-12-02
Also published as: IL176053A0; CN1902620A

Abstract

A virtual file system (209) including multiple storage processor nodes (103) including a management node (205), a backbone switch (101), a disk drive array (111), and a virtual file manager (301) executing on the management node. The backbone switch enables communication between the storage processor nodes. The disk drive array is coupled to and distributed across the storage processor nodes and stores multiple titles. Each title is divided into data subchunks (113a) - (113e) which are distributed across the disk drive array in which each subchunk is stored on a disk drive of the disk drive array. The virtual file manager manages storage and access of each subchunk, and manages multiple directory entries including a directory entry for each title. Each directory entry is a list of subchunk location entries in which each subchunk location entry includes a storage processor node identifier, a disk drive identifier, and a logical address for locating and accessing each subchunk of each title.

Description

Virtual File System

Technical field

The present invention relates to interactive broadband server system, particularly administer and maintain the Virtual File System of the data message that is distributed in array of storage devices.

Background technology

All the time, be desirable to provide the solution of a storage and transmission streaming media content.Though considered the speed of different pieces of information, when transmitting with the speed of the every stream of 4 MBPSs (Mbps), the initial target of scalability is 100 to 1,000,000 independent synchronizing content stream of while.All available bandwidth is subjected to the restriction of maximum available back plane switch, and the maximum exchange machine is in the too scope of bit (terabit) per second at present, perhaps about 200,000 while output streams.Usually, the quantity of output stream and every stream bit rate are inversely proportional.

The simplest pattern of content stores is the single disk drive that is connected to the single-processor that contains the single network connector.Data are read by dish, are positioned over internal memory, are distributed to the user by network with block form.Traditional data, for example webpage etc. transmits with asynchronous system.In other words, the data that have any amount of any time delay.Transmitted low capacity from Web server, the video of low resolution.The real-time media content, for example audiovisual needs synchronous transmission, or has the transmission that guarantees the transmission time.Like this, to the disk drive that is limited in of bandwidth, dish needs to solve the arm shape and moves and rotational latency.If at special time from driver to processor, stream when system can only support 6 continuous contents, so the 7th user's request need wait for 6 priority user one of them abandon content stream.The upper strata of this design is simple, and lower floor is a dish, and as unique mechanical hook-up in this design, it can only visit and transmit data so apace.

By adding another driver, or a plurality of driver, staggered drive access is improved.Be dissolved in each driver in also can storage replication, and obtain redundancy and high-performance.This is a scheme preferably, but has many problems.Only can be positioned in the local driver or a plurality of driver with so much content.Disk drive, each all is the fault single-point for CPU and internal memory, its consequence that causes will be catastrophic.The number of drives that disk controller can be controlled can only be born by this system.Even have a lot of devices, the problem that also exists title to distribute.In real world, everyone wishes to see up-to-date film.Rule of thumb, 80% content requests are only at 20% title.The bandwidth of all machines can not be occupied by a title, because this will hinder only being stored in the not really visit of popular titles in this machine.Therefore, " high demand " title need load in great majority or all machine.In brief, if the user wishes to see an old film, even he will unlucky-this film have packed in the system so.In a large-scale library, 80/20 ratio is bigger in the above-mentioned example of this ratio.

If also there are other inefficient problems in this system based on standard LAN (Local Area Network) (LAN) deal with data.Modern is the important breakthrough of transmission assurance aspect based on Ethernet TCP/IP system, but has packet collisions and local lost packets and transmit the problem of the time cost that causes again, need make it running to its management.Can not guarantee to obtain in time set of content streams.Each user occupies a switching port, and each content server also occupies a switching port.Therefore, the switching port number is the twice of server count, has limited whole online bandwidth.

Summary of the invention

The present invention is intended to solve the problems of the technologies described above.

According to an aspect of the present invention, proposed a kind of Virtual File System, comprised: a plurality of storage processor node, wherein comprise a management node at least, each described storage processor node comprises a port interface and a disc drive interface; Back plane switch, it comprises a plurality of ports, and each described port is connected to the corresponding port interface of described a plurality of storage processor node, and described back plane switch makes and can communicate between each nodes of described a plurality of storage processor node; Array of disk drives, its connection and be distributed in the described disc drive interface of described a plurality of storage processor node, described array of disk drives is stored a plurality of titles, each title is divided into a plurality of sub-piece that is distributed in described disk array, and wherein each sub-piece is stored in the disk drive of described array of disk drives; Virtual file manager of described at least one management node operation, it manages the storage and the visit of each sub-piece of described a plurality of titles, and a plurality of directory entries of safeguarding the directory entry that comprises each title, each described directory entry comprises a sub-piece location entries tabulation, and wherein each sub-piece location entries comprises a storage processor node identifier, a disk drive identifier and the logical address that is used to locate and visit each sub-piece of each title that is stored in described array of disk drives; And user procedures, it is executed on the storage processor node, it is to the title request of described virtual file manager submission to a selected title, it receives the respective directories item of described selected title from described virtual file manager, submit to a sub-piece to read request at each the sub-piece location entries in the described respective directories item, the request of reading of each sub-piece is sent to the storage processor node place that is identified by the storage processor node identifier in the corresponding sub block location entries in the described respective directories item and receives sub-piece, and uses the sub-piece that receives to rebuild described selected title.

According to a further aspect in the invention, proposed a kind of Virtual File System, having comprised: a plurality of storage processor node, wherein comprise a management node at least, each described storage processor node comprises a port interface and a disc drive interface; Back plane switch, it comprises a plurality of ports, and each described port is connected to the corresponding port interface of described a plurality of storage processor node, and described back plane switch makes and can communicate between each nodes of described a plurality of storage processor node; Array of disk drives, its connection and be distributed in the described disc drive interface of described a plurality of storage processor node, described array of disk drives is stored a plurality of titles, each title is divided into a plurality of sub-piece that is distributed in described disk array, and wherein each sub-piece is stored in the disk drive of described array of disk drives; Virtual file manager of described at least one management node operation, it manages the storage and the visit of each sub-piece of described a plurality of titles, and a plurality of directory entries of safeguarding the directory entry that comprises each title, each described directory entry comprises a sub-piece location entries tabulation, and wherein each sub-piece location entries comprises a storage processor node identifier, a disk drive identifier and the logical address that is used to locate and visit each sub-piece of each title that is stored in described array of disk drives; The storage of wherein said virtual file manager management title, wherein each title is divided into a plurality of data blocks, each data block comprises a plurality of sub-pieces, described a plurality of sub-piece comprises the redundant data of each data block, and wherein said array of disk drives is divided into a plurality of redundant array groups, wherein each redundant array group comprises a plurality of disk drives that are distributed in a plurality of storage processor node, and wherein described a plurality of sub-pieces of each data block are distributed on a plurality of disk drives of a respective redundant array group.

Description of drawings

According to following explanation, in conjunction with the accompanying drawings, can understand benefit of the present invention better, characteristic and advantage:

Accompanying drawing 1 is interactive content engine (ICE) the part simplified block diagram that example embodiment realizes according to the present invention;

Accompanying drawing 2 is the partial logic block schemes of ICE in the accompanying drawing 1, sets forth Synthronous data transmission system;

Accompanying drawing 3 is the part block schemes of ICE in the accompanying drawing 1, further sets forth VFS details and support function in the accompanying drawing 2 according to the embodiment of the invention;

Accompanying drawing 4 has shown table 1, sets forth the example configuration that only comprises the ICE of 3 disk array groups in the accompanying drawing 1;

Accompanying drawing 5 has shown table 2, and having set forth 4 titles is how to use the configuration of table 1 to preserve;

Accompanying drawing 6 has shown table 3, has set forth the content of initial 12 steady arms of 4 titles that are described in table 2; With

Accompanying drawing 7 has shown table 4, further set forth sub-piece how to be stored in ICE in the accompanying drawing 1 not on the same group, the details of SPN and disk drive.

Embodiment

Following explanation is in order to make those of ordinary skills can make and use the present invention who provides in particular applications context and necessary condition.But for present technique field those skilled in the art, it is obvious that the difference of optimum embodiment is revised, and can be applicable to other embodiment in this generic principles of determining.Therefore, the invention is not restricted to the specific embodiment that shows and describe here, but be consistent with consistent with principle that discloses here and novel characteristics wide region.

Architecture described herein provides the stand-alone assembly that changes performance, avoids the installation of carrying out when buying starter system.The commodity in use assembly can be guaranteed to use up-to-date excellent technique, avoids single source and obtains the least cost of every stream.Also can tolerate the fault of stand-alone assembly.In many cases, from user's angle, being in operation does not have significant conversion.But then, there is brief " self-regeneration " cycle.As a rule, some faults can be tolerated.In most cases (if not all) do not need immediately that attention system can recover, and makes it be suitable for " turning off the light (light out) " operation.

Content storage allocation and inner broadband are managed automatically by least recently used (LRU) algorithm, guarantee the current needs of content-adaptive of RAM buffer memory and hard disk array buffer memory, use base plate exchange broadband with effective and efficient manner.The bandwidth of system is seldom ordered exceed the quata (if any), therefore needn't abandon or postpone transmission packets.Therefore this structure makes full use of the synthetic wideband of each assembly, can obtain to guarantee, network is private simultaneously, is in completely and controls, even therefore under the situation of the peak demand that never reckons with, does not also have the data path overload.The stream of any bit rate can be provided, but wish that typical stream is positioned at 1 to 20Mbps scope.Based on effective bandwidth, provide asynchronous content.For the needs of application program are preserved bandwidth.File can be any size, and has minimum low storage efficiency.

Accompanying drawing 1 is interactive content engine (ICE) the 100 part simplified block diagrams that example embodiment realizes according to the present invention.For clear, do not show to be suitable for understanding completely part of the present invention.ICE100 comprises suitable a plurality of ports (or multiport) gigabit Ethernets (GbE) switch 101 as board structure, and it has a plurality of ethernet ports that are connected to many memory processes nodes (SPN) 103.Each SPN103 is the server that is simplified, and comprises two Gigabit Ethernet ports, one or more processors 107, storer 109 (for example random-access memory (ram)), and the disk drive 111 of suitable quantity (for example four to eight).Gb port one 05 on each SPN103 is connected to the corresponding port of switch 101, realizes full-duplex operation (transmission and reception when each SPN/ port connects), and is used for transmitting data at ICE100.Another Gb port (not shown) transmission content exports user's (not shown) in downstream to.

Each SPN103 can its local disk drive of high speed access and every group of five SPN of high speed access in the disk drive of other four SPN.Switch 101 is base plates of ICE100, and is not only the communicator between the SPN103.For the reason of setting forth, only shown five SPN103, but we know that ICE can typically comprise a large amount of servers.Each SPN103 is used for storage, handles and the transmission content.In the configuration that shows, each SPN103 uses existing assembly to be configured, and has not been the computing machine on the ordinary meaning.Though can consider standard operation system, such drives interrupts operating system can cause unnecessary bottleneck.

Each title (for example video, film or other media contents) not exclusively is stored in arbitrary single disk drive 111.On the contrary, each title data is cut apart, and is stored in the disk drive of a plurality of ICE100, obtains speed benefits of interleaved access with this.Single title content is distributed in a plurality of disk drives of a plurality of SPN103.The weak point of title content " time frame " is from each driver (round robin) gathering in a looping fashion of each SPN103.Launch physical load by this way, avoid the number of drives restriction of SCSI and IDE, obtain the form of inefficacy-safe operation, and discern and manage a large amount of set of titles.

In the customized configuration that shows, each content title is divided into the concrete data block of fixed size (typically every 2 megabyte (MB)).Each data block is stored in different SPN103 groups in a looping fashion.Each data block is divided into four sub-pieces, creates the 5th sub-piece of expression odd even.Each sub-piece is stored in the disk drive of different SPN103.In the customized configuration that shows and describe, sub-block size approximately is 512 kilobyte (KB) (" K " is 1024), the rating data units match of itself and each disk drive 111.One time SPN103 is divided into five, a data block of a title of every group or SPN collection storage.As shown, these five SPN103 are denoted as 1-4 and " odd even ", its common storage data block 113, SPN1,2,3,4 and odd even stored

sub-piece

113a, 113b, 113c, 113d and 113e respectively.Sub-piece 113a-113e is shown as with distribution mode and is stored in each different SPN (SPN1/DRIVE1 for example, SPN2/DRIVE2, SPN3/DRIVE3 etc.) in the different driving device, but also can be stored in any other possible combination (SPN1/DRIVE1 for example, SPN2/DRIVE1, SPN3/DRIVE3 etc.) in.Sub-piece 1-4 has comprised data, and sub-piece odd even has comprised the parity information of data sub-block.Each SPN group size though be typically five, can be arbitrarily, can be any other suitable quantity, for example two SPN to ten SPN.Two SPN can use 50% of its memory space to be used for redundancy, and ten will be used 10%.Five is the compromise of the storage efficiency and the possibility that breaks down.

Distribution of content can realize two targets at least in this way.The first, can watch the number of users of single title to be not limited to the quantity of the service of organizing by single SPN, and be subjected to the restriction of all SPN groups bandwidth together.Therefore, a copy that only needs each content title.What traded off is the restriction of New Observer person's quantity of the given title that can start p.s., but this restricted space and redundant storage management overhead far away from waste.Second target is the lifting of ICE100 overall reliability.Use parity drive, single driver malfunction is covered by its content institute of regenerating in real time, and it is similar to redundant array of independent disks (RAID).The fault of SPN103 is covered each continued operation wherein by the true institute that it comprises a driver of each group of organizing from some RAID.The user who is connected to a fault SPN is very fast to be taken over by the shadow that runs on other SPN (shadow) program.When disk drive or whole SPN broke down, the operator was apprised of maintenance or changes failed equipment.When one was lost sub-piece and rebuild by user procedures, it was passed back among the SPN that this piece is provided, and it is buffered in (just looking like to read from this domain) among the RAM there.This avoids waste the time that other user procedures are carried out same reparation popular titles, because demand afterwards is by satisfying from the sub-piece of RAM, and the enough popular and maintenance buffer memory of a straw cord for bundling up rice or wheat stalks piece.

The target that runs on the user procedures (UP) of each " user " SPN103 is from the sub-piece of himself dish gathering, adds corresponding four sub-pieces from other user SPN, forms the video content data piece and transmits.User SPN distinguishes mutually with one or more management MGMT SPN, and the latter disposes in an identical manner, but realizes different functions, and this will set forth below.A pair of redundant MGMT SPN is used to promote reliability and performance.On each user SPN 103, be implemented repeatedly for a plurality of users by converging of each UP realization with composition function.Therefore, between user SPN103, there is great amount of data transmission.The typical Ethernet protocol that has packet collisions verification and retry will be submerged.Typical protocol is for transmission design at random, and it relies on the free time between these incidents.Therefore do not use this method.In ICE100, by using structure full duplex, full exchange and, can avoiding conflict by careful managing bandwidth.Most communications can realize synchronously.Switch 101 is managed with the method for synchronization self, will further set forth below, and therefore transmission is worked in coordination with.When begin transmission owing to determined which SPN103, port can not overwhelmed than its more data that can control between given period.In fact, data at first are gathered in the storer 109 of user SPN103, synchronously control its transmission then.As a part of coordinating, existence signal between user SPN103.Unlike the actual content that mails to the final user, the data of the command transmitting between this user SPN device are very little.

If allow sub-piece to transmit at random or asynchronously, the length of each sub-piece (about 512K byte, " K " is 1024 herein) can overwhelm any available buffering in the GbE switch 101.During the so many information of transmission approximately is 4 milliseconds (ms), and wishes to guarantee that a plurality of ports do not attempt transferring to single-port simultaneously.Therefore,, control switch 101, under full situation about loading, make full use of all ports in the mode that causes synchronous operation as below will further setting forth.

The redundant directory process of managing file system (or Virtual File System or VFS) is used for reporting where it is stored in when given content title is asked by the user.When loading new title, it also can be used to distribute required storage space.All distribution are complete data blocks, and each in them all is made up of five sub-pieces.The space of each disk drive is managed in driver by LBA (Logical Block Addressing) (LBA).Sub-piece is stored in the disk drive contiguous sector or LBA address.The capacity of each disk drive is represented divided by each sub-piece number of sectors by its maximum LBA address among the ICE100.

Each title map or " directory entry " comprise a tabulation, be presented at the sub-piece of where having stored title data piece, particularly each data block and where be positioned in table.In the embodiment of this elaboration, each of the sub-piece of representative comprises the SPNID that determines particular user SPN103 in the tabulation, determine to be determined the disk drive number (DD#) of particular plate driver 111 of user SPN103 and the sub-block pointer (perhaps LBA (Logical Block Addressing) or LBA) of boil down to 64 bit values.Each directory entry be included under the specified 4Mbps about half an hour content sub-piece tabulation.This equals 450 data blocks, perhaps 2250 sub-pieces.The about 20KB of each directory entry also has auxiliary data.When the UP on operating in SPN asked directory entry, for corresponding user, whole item was sent out and local the preservation.Even a SPN supports 1,000 user, the memory span that local tabulation or directory entry only consume 20MB.

ICE100 keeps the database of all titles that can use for the user.This tabulation comprises local CD server, real-time network program design and permitted and transmitted the title of the remote location of setting.This database comprises all metadata of each title, comprise management information (between license terms, bit rate, graphics resolution etc.) and user's interest information (producer, director, performer, making member, author etc.)。When the user makes a choice, inquire about the catalogue of Virtual File System (VFS) 209 (accompanying drawings 2), determine whether this title is carried in the disk array.If no, then be this content start loading procedure (not shown), UP is apprised of when this content can be watched if necessary.In most cases, be no more than the mechanical delay time of light disk retrieval automatic equipment (not shown) this time delay, perhaps about 30 seconds.

The canned data (not shown) comprises the title that all metadata (they read in database when dish for the first time is loaded in the storehouse) and representative can be collected in advance about these data stream and the compressed digital video and the audio frequency of all information on CD.For example, it comprises the pointer for all relevant informations in data stream, as clock value and timestamp.It has been divided into sub-piece, has odd even temper piece precomputation and storage on dish.Usually, any can carrying out in advance all is included on the CD with the content of saving loading and processing expenditure.

Being included in the resource management system is the scheduler program (not shown), and UP seeks advice from this scheduler program is used for its stream with reception start time (usually in several milliseconds of request).That scheduler program guarantees to keep evenly, wait in the load in the system is minimum, and the bandwidth that in ICE 100, needs all be no more than available bandwidth in that institute is free.When user request stops, time-out, F.F., retreating or when interrupting other operation of flowing of its stream, its bandwidth is reallocated, and newly distributes for any new service (for example, fast-forward streams) of request.

Fig. 2 is the logical block diagram of the part of ICE 100, shows the synchronized data transfer system of realizing according to embodiments of the invention 200.Switch 101 is expressed as being connected on several typical SPN103, comprises the first user SPN 201, the second user SPN 203, reaches management (MGMT) SPN 205.It is as mentioned previously that like that a plurality of SPN 103 are connected on the switch 101, and in order to explain that the present invention has only two user SPN 201,203 to be expressed, and in fact only implement as any SPN 103 as previously described.MGMT SPN 205 in fact only realizes as any other SPN 103, but finishes management function rather than specific user functions usually.SPN 201 shows some function, and SPN 203 shows each user SPN

Other function of 103.Yet, be appreciated that each user SPN 103 is set to finish similar functions, thereby the function of describing for SPN 201 (and process) also is provided at SPN 203, and vice versa.

As previously described, switch 101 is operated with every port one Gbps, thereby each sub-piece (about 512KB) takies about 4ms to pass to another from a SPN.Each user SPN 103 carries out one or more user procedures (UP), and each user procedures is used for supporting a downstream user.When the new piece that needs title refills user's output buffer (not shown), ask from other user SPN that stores those sub-pieces from following five sub-pieces of tabulation.Because a plurality of UP may ask a plurality of sub-pieces substantially at one time, so the transmitting continuous time of sub-piece additionally can be flooded the surge capability of the almost any GbE switch that is used for single port, opinion is not used for whole switch.This for shown in switch 101 be real.If do not manage sub-piece transmission, all five sub-pieces that then cause being used for each UP may return simultaneously, flood the output port bandwidth.Wish to tighten up the timing of transmission of the SPN of ICE 100, thereby most critical data is transmitted at first and in good condition.

SPN 201 is shown as and carries out UP 207 so that serve corresponding downstream user.User's request header (for example, film), this request is forwarded to UP 207.UP 207 is sent to the VFS 209 that is positioned on the MGMT SPN 205 (below further describe) to title request (TR).VFS 209 turns back to UP 207 to catalogue entry (DE), and UP 207 this locality are stored in the DE that 211 places show.DE 211 comprises the tabulation of each sub-piece of locating title (SC1, SC2, or the like), each clauses and subclauses comprise identification specific user SPN 103 SPNID, the described SPN that is identified 103 of identification particular plate driver 111 disk drive number (DD#), and on described identification disk drive, provide the address or the LBA of the ad-hoc location of sub-piece.SPN 201 stabs the request of reading (TSRR) start-up time one at a time for each the sub-piece in DE 211.In ICE 100, carry out described request immediately and directly.In other words, SPN 201 proposes request for sub-piece immediately and directly to the specific user SPN 103 of storage data.In the structure that shows, even local storage is also asked with the same manner.In other words, even the sub-piece of request resides on the local disk drive of SPN 201, it also sends request through switch 201 as remote arrangement.Network is to be configured to the request that identifies is sending to same SPN from SPN location.Similarly disposing all scenario can be simpler, and particularly request therein is actually the local less big facility of possibility.

Although request is sent immediately and directly, each all returns sub-piece with complete way to manage.Each TSRR uses SPNID to specific user SPN, and comprises DD# and the LBA that is used for targeted customer SPN, with retrieval and return data.TSRR can also comprise any other identifying information, this information be enough to guarantee the sub-piece of asking suitably turn back to suitable requestor and make the requestor can the recognin piece (for example, distinguish the multiple UP that on the SPN of destination, carries out the UP identifier, distinguish a plurality of sub-pieces that are used for each data block sub-block identifier, or the like).Each TSRR comprises also when identification carries out the timestamp (TS) of the concrete time of raw requests.TS identification is used for the priority of the request of synchronous transmission purpose, and its medium priority is based on the time, thereby early request presents higher priority.When being received, being requested the sub-piece of returning of title and being stored in the local title storer 213, so that further handle and be sent to the user of this title of request.

User SPN 203 shows operation and the support function that goes up the transport process of carrying out (TP) 215 at each user SPN (for example, 201,203), is used for receiving TSRR and is used for returning the sub-piece of request.TP 215 comprise the storing process (not shown) or otherwise with the storing process interfaces, this storing process and local disk drive 111 interfaces on SPN 203, it is used for request and the sub-piece of visit storage.Storing process can realize with any desired way, as state machine etc., and can be the detachment process of interface between TP 215 and local disk drive 111, as for known to those skilled in the art.As expressed, TP 215 receives one or more TSRR from the one or more UP that carry out at other user SPN 103, and each request is stored in the read request queue (RRQ) 217 in its local storage 109.The request list of RRQ 217 storage antithetical phrase piece SCA, SCB or the like.Storage is requested the disk drive of sub-piece and removes corresponding requests from RRQ 217, they is classified with PS, and carry out each with classified order then and read.Visit for the sub-piece on each dish is managed by group.Each group according to " elevator seek (elevator seek) " operation (single pass from low to high, scanning next time from high to low, or the like, thus the coiled hair that strides across panel surface comes flyback retrace, suspends to read the sub-piece of next order) press PS and classify.The request of successfully reading is stored in by the success of TS series classification and reads in the formation (SRQ) 218.The request (if any) of reading for failure is stored in the failed read queue (FRQ) 220, and failure information is forwarded to the network management system (not shown), the definite wrong and suitable corrective action of this network management system.Note, in the structure that shows, formation 217,218 and 220 storage solicited message rather than actual subchunks.

Successfully the sub-piece of each that reads is placed on in the storer that the LRU high-speed buffer of the sub-piece of request keeps recently.For each tested large rope piece, TP 215 creates corresponding message (MSG), this message comprises source (SRC) (for example transmitting the SPNID of sub-piece and its actual memory location and any other identifying information from it), and the sub-piece destination that is passed to be passed (DST) SPN (for example, SPN 201) of the TS that is used for sub-piece, sub-piece.As showing, SRQ 218 comprises message MSGA, MSGB of being respectively applied for sub-piece SCA, SCB or the like or the like.Read with the requested sub-piece of speed buffering after, TP 215 sends to corresponding MSG on MGMT SPN 205 the synchronous switch manager of carrying out (SSM) 219.

SSM 219 receives from TP and sorts by priority from the multiple MSG of user SPN and it, and finally the request of transmitting (TXR) is sent to the TP 215 of one of MSG of being identified among its SRQ 218, as using message identifier (MSGID) etc.When SSM 219 sends to TP 215 to the TXR that has the MSGID that is identified in the sub-piece among the SRQ 218, request list moves to network transport process (NTP) 221 (wherein " move " indication from removing request of SRQ 218) from SRQ 218, and these process 221 foundation are used for the grouping of bundle block transfer to destination user SPN.Wherein the order of removing sub-piece request list from SRQ 218 needs not to be order, although the timestamp order press in tabulation, because have only SSM 219 definite suitable orderings.SSM 219 sends to a TXR has at least one height piece each other SPN 103 to be sent, unless described sub-piece to send to planned to receive equate or the SPN 103 of the sub-piece of higher priority on UP, as further described below.SSM 219 is then to all user SPN 103 broadcasting single transmission orders (TX CMD).TP 215 responses indicate NTP 221 the request UP of described sub-block transfer to user SPN 103 by the TX CMD order of SSM 219 broadcasting.By this way, each SPN 103 that has received TXR from SSM 219 is sent to another request user SPN 103 simultaneously.

VFS 209 management header lists on MGMT SPN 205 and they at ICE

Position in 100.In typical computer system, catalogue (data message) reside in usually data on the resident same dish.Yet in ICE 100, VFS 209 centralized arrangement distribute on a plurality of dishes of disk array because be used for the data of each title with the management distributed data, and these a plurality of dishes distribute on a plurality of user SPN 103 again.As previously described, the sub-piece of the 111 main storage titles of the disk drive on user SPN 103.VFS 209 by PSNID, DD#, and LBA, comprises the identifier of the position that is used for each sub-piece as described above.VFS 209 also comprises the identifier of the other parts (as optical memory) of outside ICE 100.When user's request header, the set fully (ID/ address) of directory information can be used for the UP that carries out on the requesting users SPN 103 that has received the user.Therefrom, task is that the bundle block transfer is left disk drive to storer (impact damper), through switch 101 they is moved to request user SPN 103, this request user SPN 103 complete piece of assembling in impact damper, it is transported to the user, and repeats up to finishing.

SSM 219 presses the timestamp order and creates " preparation " messaging list in preparing message (RDY MSG) tabulation 223.Wherein receiving the order of message from TP on user SPN 103 needn't be by the timestamp order, but by the TS order in RDY MSG tabulation 223.Just before next transmission set, SSM 219 stabs from earliest time and begins to scan RDY MSG tabulation 223.SSM 219 at first is identified in the TS the earliest in the RDY MSG tabulation 223, and produces and send the TP 215 of corresponding TXR message to the user SPN 103 of the corresponding sub-piece of storage, to start the current transmission of this sub-piece.SSM 219 presses the TS order for each sub-subsequently piece and continues scan list 223, produces the TXR message that is used for each sub-piece, and the source and destination of this sub-piece also are not included in the current sub-block transmission.For each the TXCMD broadcasting to all user SPN 103,103 of each user SPN once transmit a sub-piece, and only once receive a sub-piece, although it can side by side carry out both.For example, transmit with the current sub-block of planning SPN#2 if TXR message sends to the TP of SPN#10, SPN#10 can not send another height piece simultaneously so.Yet SPN#10 can receive sub-piece from another SPN simultaneously.In addition, SPN#2 can not receive another height piece simultaneously when receiving sub-piece from SPN#10, although SPN#2 can be sent to another SPN simultaneously, this is because the full duplex nature of each of the port of switch 101.

SSM 219 continues scanning RDY MSG tabulation 223, considers all user SPN 103 up to, perhaps when the end that arrives RDY MSG tabulation 223.Finally remove (perhaps when sending TXR message or after finishing transmission) with each clauses and subclauses in the corresponding RDY MSG tabulation 223 of TXR message from RDY MSG tabulation 223.When for the previous period last transmission has finished, SSM 219 broadcasting TX CMD groupings, this TX CMD grouping is signaled with the transmission of beginning next round to all user SPN 103.For described customized configuration, be transmitted in the period that is similar to 4 to 5 seconds at every turn and take place simultaneously.During each delivery wheel was inferior, extra MSG sent to SSM 219, and newly TXR message passes out to user SPN 103 with the transmission of plan next round, and repeated this process.Period between continuous T X CMD is approximately equal to transmits the sub-piece necessary period of all bytes, comprise grouping overhead and inter-packet delay, add the period of all speed bufferings that removing may take place in switch between the transmission period of sub-piece, typically be 60 microseconds (μ s), add the period of any shake that the delay when considering by independent SPN identification TX CMD causes, typically less than 100 μ s.

In one embodiment, duplicate or mirror image MGMT SPN (not shown) is the mirror image of main MGMTSPN 205, thus SSM 219, VFS 209, and scheduler program each all be replicated on a pair of redundant dedicated MGMT SPN.In one embodiment, TX CMD broadcasting synchronously is as the pulsation (heartbeat) of the health of indication MGMT SPN 205.Pulsation sends to auxiliary MGMT SPN, and all is right in indication.Under the situation that is not having pulsation, auxiliary MGMTSPN in the section, as for example in 5ms, takes over all management functions at the fixed time.

Accompanying drawing 3 is part block schemes of ICE, further sets forth details and the support function of VFS209 according to one embodiment of present invention.As shown in the figure, VFS209 comprises virtual file manager (VFM) 301 and VFS interface manager (VFSIM) 302.VFSIM302 is the communication port between VFS301 and the ICE100 remainder, and described remainder comprises System Monitor (SM) 303, storehouse loader (LL) 305 and user's master monitor (UMM) 307.VFSIM302 receives request and indication from SM303, and provides service to LL305 and UMM307.The request and the indication that offer VFM301 are lined up and are preserved, up to being obtained.The response of VFM301 is buffered, and is back to requester.The VFSIM302 management is by its background task own and VFM301 starts.These tasks comprise automated content and cut apart (re-striping), storage device validation/repair and capacity increase and minimizing again.The notice of VFSIM302 monitoring hardware interpolation/removal; The pen recorder sequence number is so that start checking/reparation when needing automatically.The discussion here relates to VFS209, and it relates to VSM301 and VFSIM302 respectively or all, except as otherwise noted.

VFS209 is with the maximum overall system performance and be convenient to from the mode of hardware fault recovery management title content storage (being distributed in memory storage or disk drive).VFS209 is designed to support hardware configuration on a large scale flexibly as much as possible, and each site deployment of ICE100 can both fine-tune hardware be paid, and satisfies specific operating characteristic.When but total system kept running status, by adding new SPN103, website can promote the capacity of oneself.Similarly, when keeping operation, VFS209 also can make SPN and each memory storage (for example serial ATA (SATA) driver) alternately work and not work.The quantity of SPN103 is only implemented the restriction (such as approximately being 500SPN at present) of back plane switch bandwidth of the current maximum of switch 101 among the ICE100.Each SPN103 can have the memory storage (usually for given website, each SPN memory storage quantity is constant) of any amount, and each memory storage contains different memory capacity (approximately or equal the minimum value of this website appointment).At present,, be typically each SPN103 and contain the 1-8 hard disk drive, though new equipment type (but when new equipment type time spent) can be enough held in this design neatly for a website.Further, if the capacity of single physical SPN103 is the twice of website minimum capacity or three times, can add two or three logic SPN (this is applicable to the situation of any even-multiple of described minimum capacity) to VFS209.Design described VFS209 and make each website of permission progressively to upgrade its hardware, as required, when each the interpolation, use best available hardware along with the time.

VFS209 is organize content intelligently.It has some regulations to handle the problem of peak load reposefully, postponement is not serious task, automatically reallocation content (cutting procedure again) is to make full use of the site capacity of lifting, it preferentially carries out fault recovery with expecting instruction before needs and rebuild content, and it has the stronger ability of recovering content from the memory storage of previous use simultaneously.In an embodiment who shows, VFM301 communicates by letter with VFSIM302 exclusively, and VFSIM302 provides service by the SM303 management and to LL305 and UMM307.When powering up, VFS209 does not understand the system hardware configuration.When each user SPN103 oneself imported and states, SM303 made up the correlative detail (its group contact, dish quantity, the memory capacity of each dish etc.) of this SPN, and it is registered at VFSIM302, and VFSIM302 notifies VFM301.Though each SPN can memory contents, is not all to need so to do.It is blank panel that VFS209 allows any amount " heat idle (hot spare) " to be saved with the deposit, prepares to work when fault restoration, periodic maintenance or other purposes.

Initial at website, make decision about SPN quantity in the RAID group.Content is evenly distributed in each SPN group, therefore must add a website in SPN to the RAID group increment.Unique exception is to be designated as the idle SPN (it independently adds with any amount) and the SPN of Redundancy Management.When system initialization, most SPN103 are added, but new SPN group can be added at the random time point in the whole process of system.When a website promoted its capacity by adding new SPN group, existing content was automatically cut apart (below will explain the process of cutting apart again in more detail) again on the backstage, make full use of the hardware of new interpolation.By cutting apart for the first time the size that ICE100 is dwindled in (cutting procedure on backstage) realization again, deletion has been disengaged assigned unit then.

In VFS209, each SPN103 fully at random distributes a logic ID, but for convenience, it is corresponding with the physical address of SPN usually.In case add, given SPN is present among the VFS209 as logic entity, up to deleted.Any leisureless SPN can be replaced by another SPN, when this situation takes place, distributes same logical address.Therefore, arbitrarily exchange physical SPN (below will explain in more detail) can provide the execution periodic maintenance without break in service.In case all the SPN group can begin memory contents in this group in the VFS209 registration.But in order to allow to carry out unified content allocation on total system, before loading first title, the SPN of all memory contentss group should be registered.

As previously mentioned, each data block store of title content is in different groups, and content is distributed in all groups with circulation pattern.More especially, each data block is divided into sub-piece (quantity of sub-piece equals the size of the group of this website, derives the sub-piece of odd even as one of sub-piece of array from data sub-block), and each sub-piece is stored on the different SPN of particular group.For example, suppose the RAID size of five disk drives, the size of SPN group is five (each content-data piece has five sub-pieces) so.If each SPN comprises four drivers, then have four RAID groups.First group of driver 1 that comprises each SPN; Second group of driver 2 that comprises each SPN, or the like.

Consider the exemplary configuration of ICE100, as described first title of table 1 in the accompanying drawing 4, " title 1 ", and it comprises three groups of GP1-GP3, and wherein each group is designated as GP, and each data block is designated as C, the designated SC of sub-piece of each data block.The table 1 of accompanying drawing 4 has shown that label is 3 groups of GP1-GP3, and label is that 12 data blocks of C1-C12 and the label of data block are 5 sub-pieces of SC1, SC2, SC3, SC4 and SCP, and wherein last " P " sub-piece is represented the sub-piece of odd even.The first data block C1 of title 1 is recorded as 5 sub-piece SC1-4, SCP (the 5th sub-piece is the sub-piece of odd even), lays respectively in the driver 1 of SPN1-5 of first group of GP1.The next data block C2 of title 1 is recorded as 5 sub-pieces (once more for sub-piece SC1-4, SCP), lays respectively in the driver 1 of SPN1-5 of second group of GP2.Equally, the 3rd data block C3 is recorded in the driver 1 of SPN1-5 of the 3rd group of GP3.The 4th data block C4 is recorded in the driver 2 of SPN1-5 of first group of GP1.Table 1 has shown first title, and how " title 1 " is stored.Lose whole SPN (delegation of table 1) and cause losing a driver in each groups of four RAID groups.All RAID group continues production contents, and builds and lost content not by odd even again.In group that begins immediately following previous header and driver, begin other title.Therefore, second title, title 2 (not shown)s, beginning in the driver 2 of GP2 (2, the three data blocks of driver that second data block is positioned at GP3 are positioned at the driver 3 of group 1, or the like).Title postpones to minimize the start time with this allocation scheme.Each title is in a spiral manner around ICE100, from organizing the driver 4 of each SPN of 3, is circulated to the driver 1 of each SPN of group 1.

The table 2 of accompanying drawing 5 has shown four titles of configuration store that how to use in the table 1.For the purpose of setting forth, the first title T1 comprises 24 data block T1 C1-T1 C24, the second title T2 comprises 10 data block T2 C1-T2 C10, and the 3rd title T3 comprises 9 data block T3 C1-T3 C9, and the 4th title T4 comprises 12 data block T4 C1-T4 C12.In order to simplify, each during 3 SPN organize (SPN group 1, SPN group 2, SPN group 3) is folded to delegation, and first data block of each title underlines, and is set to runic.The typical title of 4Mbps comprises 1350 data blocks, is arranged in three VFS directory entries, and each comprises 450 data blocks, and it has represented one and a half hours content.Use 100 gigabytes (GB) disk drive, each RAID group contains and surpasses 200,000 data blocks (meaning that each driver in the group contains above 200,000 sub-pieces).The sub-piece distribution of each driver of RAID group typically is positioned at the same point (LBA (Logical Block Addressing)) of each driver.

In the configuration of setting forth, each directory entry (DE) of VFS209 comprises the different metadata about title, and array of chunk locators.The chunk locators data structure comprises 8 bytes: two bytes are determined described group, and two bytes are determined described dish, and four bytes are determined the allocation block (block) of described dish, and wherein each piece contains a sub-piece.Accompanying drawing 6 has shown table 3, and it has set forth the content of 12 steady arms at first of 4 title T1-T4 (being shown as title 1-title 4) in the accompanying drawing 2.12 steady arms in top that do not show of title 1 use up the piece 2 of each dish.Duplicate look-up table in VFSIM302 and each SPN103, this look-up table shines upon MAC (medium access control) ID of the logical address of each dish to the SPN that it connected.Multiply by the sector number of each sub-piece by described number of simple usefulness, can obtain LBA corresponding to sub-piece.Accompanying drawing 7 has shown table 4, and it has further set forth the different RAID groups how sub-piece is stored in ICE100, SPN (label is 1-5), and the details of disk drive (label is 1-4).For example, the sub-piece Sa of the data block C01 of title T1 is stored in the piece 0 of dish 1 of the SPN1 of RAID group 1, and next height piece Sb of the data block C01 of title T1 is stored in the piece 0 of dish 1 of the SPN2 of RAID group 1, or the like.

The variation of content-length causes being stored in the small and uncertain variation of the content quantity of each SPN103.For these exemplary titles, this changes by exaggerative, but for become hundred each self-contained 1,000 or the title of multidata piece more, the difference between expectation SPN is less than 1%.Though single memory storage contains the random capacity greater than site minimum, the capacity that surpasses site minimum can be not used in stores (isochronous) content when waiting.Therefore, it is enough big that site minimum keeps as much as possible, and typically it should be set to equal the capacity of minimum capacity memory storage in the website.Site minimum can raise or reduce at any time, for example whenever replaces the minimum capacity device than bigger device, and it should be increased to a higher value.

According to the given configuration of ICE100 where be installed in and how to use, VFS209 can receive the request of the storage allocation of new title rarely, perhaps it can be in up to a hundred almost simultaneous requests of every halfhour initial reception.In order rapidly and efficiently to satisfy the demand of storage expectation, VFS209 has kept the pond of a pre-assigned directory entry.The size in pond can change the size in pond, to regulate or to respond the variation of website characteristic at any time based on the use overview of website and set in advance.When VFS209 has received storage allocation request, it at first attempts to satisfy described request from the pond of pre-assigned directory entry.If it is available, a pre-assigned directory entry is returned to this requestor immediately.If the pond is used up, produce new directory entry desirably, as below describing.If a request for allocation needs a plurality of directory entries of same title, have only first to be returned immediately.The distribution that this title is remaining will take place subsequently, therefore load this task to the background program tabulation of being safeguarded by VFS209.Replenishing pre-assigned directory entry pond also is a background task.

In order to produce directory entry (pre-assigned or as required produce), VFS209 need at first determine the capacity that is required whether available (for example not being used at present).If available, can easily finish request.If unavailable, VFS209 removes and distributes one or more least recently used (LRU) title, so that finish request.When a title is removed distribution by this way, VFS209 this incident inform SM303 and SPN103.When VFS209 returns first directory entry to requestor (or caller), the then preliminary request for allocation of finishing.When a title contains a plurality of, when the caller can describe in detail need which the time, provide as required with consequent.Each comprises the table that can store the sub-piece steady arm of 30 minutes contents.Therefore, a film of 95 minutes needs 4 items, and the 4th item of most applications is insufficient use.Or rather, common the 4th item table is obsolete substantially, but do not have the space that is wasted on disk drive, and be needed because the actual disk drive of this disk space consumption only is described 5 minutes content.In inside, VFS209 use efficient storage data structure is followed the tracks of the available sub-piece position on each memory storage.

By merging last valid data piece (LVC) pointer in each, regaining untapped storage space becomes possibility.In above-mentioned example, when offering the requestor, the 4th contains the storages in 30 minutes of withing a hook at the end at first.When the assembly of actual storage content finished its task, it upgraded the LVC pointer and informs VFS209.VFS209 discharges any untapped zone then, can use elsewhere.Owing to be variable on length, each title can finish everywhere, does not need because of any former thereby waste disk space, and for example alignment is stored to any border.Therefore, VFS209 compresses whole dishes as far as possible, any next free block of use device.At first, in order to simplify, small documents (for example can be contained in the system file in single fully) and other guide are managed in the same way.At last, add little FVS performance, its process data block makes that it is a disk drive seemingly for it, with in order to store many small documents.

SM303 also can indicate VFS209 to remove the distribution title at any time, for example works as the title approval time limit and arrives, perhaps other reasons.Remove assignment commands and can use this fact of title and complexity owing to current, when this happens, in one embodiment, remove distribution does not finish till each user of this title of visit sends the signal of the whole uses that finish this title always.VFS209 follows the tracks of current all that used by each UMM307, also follows the tracks of the item that is used by background program.At timing period, do not allow new user capture to be marked as and remove the title that distributes.

After adding or deleting new SPN group, redistribute existing content, or " cutting apart again ", to make the utilization of resources unified as far as possible during the cutting procedure again.VFS209 is in office to what is the need for to finish automatically when wanting again and cuts apart.For thing is oversimplified, new and old item are without any overlapping; For the new old shared storage area (as follows) that do not exist.In case end (because tempo is subjected to the restriction of available bandwidth, the concluding time can not expect) is duplicated in new cutting apart again, new user can visit it, and uses standard program to remove simply and distribute old duplicate.In cutting procedure again, most of sub-pieces copy to different SPN from original SPN, but fraction is copied to the different azimuth of same SPN.The sub-piece ratio that is retained in same SPN is m/ (m*n), and wherein " m " is the previous quantity of SPN, and " n " is the new quantity of SPN.For the website that is updated to 110 SPN from 100,100 in per 11, the 000 sub-pieces are replicated in same SPN.

The true-time operation content is instantaneous example fully, also comprises the example of just preserving its content.If whenever need instantaneous real-time buffer, in one embodiment, ICE100 uses single 30 minutes directory entries as cyclic buffer, and when no longer needing, uses standard program to remove and distribute this to be used for any other title.When real time content just is saved, need 30 minutes outer items of amount of the claim, necessary words VFS209 removes and distributes the LRU title.The same with other any titles, original contents can be used to play back to the point of being represented by the LVC pointer immediately, and the LVC pointer regularly is updated when storage continues.In some occasions, " original contents " was divided into concrete title before can being used by the user, and wherein said user wishes to ask it in original contents after the initial start time.After being ready to, be added into VFS209 by content of edit, as other any titles, and original contents can be deleted.

Wish to make the SPN103 or the disk drive off-line of operating once in a while, to satisfy other what purposes.Not having a negative impact in order to finish this operation, ICE100 is set, use one of heat free time to duplicate as content container, is " clone " described device more accurately.When reproduction process end (once more, because it is subjected to the restriction of available bandwidth, the time can not expect), described clone presents the sign of previous device subsequently, continues with notification received VFSIM302 and SPN103 to smooth operation.Unless this device physical ground disconnects and is connected to ICE100 (unless just it is not taken out and remove) again, VFM301 is without any need for participation, because the exchange of described clone's process and identity is that sightless (SPN is a logic entity for VFM301 for VFM301, not physical entity, because internet usage agreement (IP) address but not MAC ID).As dish or SPN when being connected to ICE100, it is finished checking/maintenance process (describing below) automatically and finishes to guarantee data.

From the angle of any given content stream, the loss of a memory storage or the loss of whole SPN seem it all is the same.Especially, every n data block omitted a sub-piece (n is determined by the quantity of SPN103 in the system).Build by odd even, indication ICE100 remedies this class loss again, allows the enough time of hardware replacement.Maintenance, checking and clone are the processes specific to dish.In order to keep in repair, verify, clone SPN, only need the program of each dish among the simple SPN of startup.When UP sends the sub-piece request of data block, and any one sub-piece is not when returning in the given time, and UP uses the sub-piece that is retrieved to come the sub-piece of reconstructing lost.In one embodiment, the sub-piece of reconstruction is sent to user SPN, this sub-piece should from this user SPN obtain and no matter the reason of fault be what (just because the fault of the driver of SPN or SPN or just delay of network).If can not receive this and rebuild sub-piece (for example SPN temporarily can not reach the standard grade or fault is restricted to the disk drive of SPN) as obtaining this user SPN that loses sub-piece, then it only is to lose in transmission course.If this SPN can receive this and rebuild sub-piece (for example, described SPN recovers the disk drive that online or fault is limited in this SPN), then it cushions this sub-piece in storer, just looks like that it reads from local disk drive.

Heat interchange and parity reconstruction need each SPN103 to recognize whether each piece of each device is effective.At first, when SPN reached the standard grade, it did not have active block.When SPN receives and stores sub-piece (perhaps checking has existed), its this piece is labeled as effectively.When SPN receives one when being stored in the request that is designated as in the invalid piece, this SPN replys with a request and receives this sub-piece.If describedly lose sub-piece and produce by parity reconstruction in other places of ICE100, it is beamed back SPN (use available bandwidth) being stored, and described is marked as effectively.Shortage represents that to the request of this sub-piece this SPN remains useless, and does not rebuild sub-piece and need be sent out.Use this agreement, use the outer cost of jot to be reentered into replacement device.Simultaneously, in order to catch those because their high requests and the data block looked after, simple background authentication/maintenance process finish the beginning-the Mo reconstruction, skip and be designated as effective zone.

Under specific environment, when containing the connection again of dish of effective content, instruct SPN193 not consider that it can not send and be designated as forbidding of invalid sub-piece as identify previously known as VFSIM302.If sub-piece on probation is by check and test, this sub-piece can use (and source SPN mark its effective), therefore can avoid the unnecessary expense of parity reconstruction.SPN fails to provide and is requested sub-piece and fails to ask this sub-piece to represent the SPN fault.By monitoring these faults, ICE100 automatic informing system operator, and start recovery routine in the operating period of turning off the light.

When different physical disks replaced the existing dish of a content, VFSIM302 started and the checking/maintenance of management dish automatically.For dish checking/maintenance, VFM301 prepares the dish maintenance (DRE) similar to the directory entry that has used, but has some little differences.450 sub-pieces comprise from the data block that surpasses a title all from bad driver.The check of each sub-piece (comprise and losing) and also involved.DRE is with most recently used start of header, and the mode of being closelyed follow or the like by next most recently used title provides.If title is not exclusively suitable, this does not have any relation, because in the end a stopping place obtains next DRE.Because do not know the entire quantity of DRE in advance, DRE contains a mark simply and informs whether it is last.This process allows described maintenance to carry out in orderly, preferential mode, keeps maximum data integrity.

All wish when whenever loss of data occurring and can keep in repair, for example when a new dish replaces faulty disk.When faulty disk during, in whole SPN103, recover to receive new building ICE100 invalid somewhere.Once use a DRE, the group member of sub-piece is lost in main frame SPN request, and uses them to be used for parity reconstruction.Preserve this and rebuild sub-piece, and this piece of mark is effective.On the other hand, if faulty disk is connected to idle SPN, VFSIM302 discerns it, and attempts to recover any available sub-piece, to make great efforts to reduce the amount of needed parity reconstruction.VFSIM302 at first sends DRE to idle SPN, there its service test and and steady arm detect the validity of candidate subchunk.When one pass through after, this sub-piece of idle SPN mark is effectively, and is sent to the SPN that needs it, it is stored as effectively there.When idle SPN has recovered and sends the sub-piece that all can send, it has finished this DRE its notice VFSIM302.If at this moment be not that all sub-pieces are resumed, VFSIM302 sends this DRE to the SPN that receives new building, and carries out parity reconstruction at needs.

When a reconstruction dish when a SPN moves to another, whenever what dish or SPN were connected to system all wishes to carry out content check.The affirmation process is the same with maintenance process in essence, and is just faster.Same DRE uses each candidate subchunk, one of its one-time detection.For the sub-piece calculation check that is present in dish and.If calculated verification and with DRE in verification and being complementary, this sub-piece is considered to effective.If the sub-piece that this is lost is rebuild and stored to verification and not matching from other SPN requests of RAID group other four sub-pieces corresponding to this sub-piece.Proof procedure is faster than reconstruction process, only is because most sub-pieces (if not all) will be by initial checksum test.Because proving program is identical with reconstruction algorithm, makes the operator can move a driver easily to its correct groove, even reconstruction algorithm has only been finished a part.When the operator extracts the dish that part is rebuild, then reconstruction algorithm is abandoned, and when this dish is put into new slot, starts new checking/reconstruction algorithm.

Clone simpler than reconstruction/proving program because only need from the host apparatus copy data.Clone's main frame is sent to the recipient with memory contents, clones main frame in addition and sends according to the variation that takes place.This means after content whole transfers to the recipient, allow clone's program ad infinitum idle, keep these two devices synchronous always.After the clone finished, clone's device presented the logical identity of host apparatus, no longer needs further affirmation (unless this device is moved).Except the latent effect in the checking, VFS209 does not relate to the clone.Because main frame is responsible for sending and synchronously, need not create (damaging then) duplicate data structures for the recipient in VFS209.

About the request from SM303, VFS209 can report the information that helps Management IC E100, comprises the most normal use (MRU) header list (not shown) and the device operation report (not shown) that comprises statistics.MRU tabulation comprises a record of each title of each current storage, and with this title specifying information together, for example it is by the date of last-minute plea, and it is requested total number of times, and it is size all, with and whether can be deleted.The device operation report comprises the record of each SPN, and its IP address is provided, and ID is for example installed in its group contact, and the array that contains each storage device information, the sum of the sum of piece and present allocation block.The item of each critical event is added in also participation system daily record of VFS209.

Recognize that now Virtual File System according to the present invention provides an organized distribution of title data, the access speed and the storage efficiency of its maximization title.Each title is divided into a plurality of sub-pieces, is allocated in the disk drive of the array of disk drives that is connected to a plurality of storage processor node, and described a plurality of storage processor node comprise a management node.Be executed in the virtual file manager managed storage of management node and each sub-piece that visit is stored in each title of array.Virtual file manager is kept the directory entry of each title, and each directory entry is the tabulation of the sub-piece location entries of this title.Each sub-piece location entries comprises a storage processor node identifier, disk drive identifier and location and visits the logical address of each sub-piece of each title that is stored in array of disk drives.

The centralization of file management provides and has been better than coiling and the many advantages of the prior art of storage system.File or " title " can be any sizes that is the driver memory capacity of all combinations to the maximum, are not subjected to the restriction of single driver or redundant storage group.Because directory information is a centralized stores, all told of each driver can memory contents.Each request of title is not subjected to the restriction of a disk drive or some disk drives, but described load distribution is in all disk drives that mostly are most in the array.The synchronous exchange manager receives once that by guarantee each node during sequential delivery a data sub-block makes maximizing efficiency.The file manager of centralization allows to realize at utmost exporting the bandwidth of each disk drive, and does not require any local directory classification of any disk drive.In one embodiment, logic to the physics that uses factory to be provided with on each disk drive shines upon again, and permission information is resumed by single searching operation from each driver.As the skilled person understands, standard directory seek punishment is serious, reduces drive bandwidth, far away from half of its regulation.On the contrary, each sub-piece location entries is enough to location and the corresponding sub-piece of visit title, thereby minimizes the cost of each storage processor node retrieval and forwarding data sub-block.Do not need with the complicated operations system interface or the directory search in the middle of carrying out or or the like.The transmission course of the processor node of being discerned provides logical address (that is, LBA (Logical Block Addressing)) to visit sub-piece by the disk drive to identification, and the disk drive of this identification returns the sub-piece that is stored in described logical address immediately.

Virtual File System further uses data and/or process redundancy protecting to prevent loss of data; not break in service during rebuilding; the redundant storage group is crossed over each storage processor node; allow the fault of any driver of any driver, each redundant dish group (for example RAID array), perhaps remove any individual node of all its drivers.Each driver is unique to be determined, and permission system when starting disposes automatically, quickly local fault or the expection fault recovery from coiling.When producing drive error, carry out parity reconstruction, data reconstruction is sent to node, and these data should derive from this node, therefore can be at this buffer memory.Such structure and process are avoided the redundant reconstruction of popular titles, are replaced up to driver and/or node, and it is saved for the user procedures that is distributed in node provides the main time.Further, carry out the Redundancy Management node of redundant virtual file manager, can be when any Single Point of Faliure incident of total system, not interrupt operation.

Many other advantages have also been obtained.Request when interactive content engine 100 can be because of hundreds of storage allocation and overloading.Its use catalogue handle (less than the bandwidth of 100,000 streams 1%) and allow multitude of video stream to be write down simultaneously and playback, and the system that do not overload.Its allow management function (for example allocate in advance storage, again cut apart content, deletion title and clone driver and SPN) betide the backstage, and do not disturb synchronizing content playback and injection.

Though the present invention has carried out quite at length setting forth with reference to certain preferred versions, also other versions and variation may occur and consider.The person skilled in the art should understand, they can use the idea and the specific embodiment of exposure easily, provide the present invention the foundation of same purpose structure as designing or revising other, and do not leave principle of the present invention and scope, defined as following claim.

Claims

1. Virtual File System comprises:

A plurality of storage processor node wherein comprise a management node at least, and each described storage processor node comprises a port interface and a disc drive interface;

Back plane switch, it comprises a plurality of ports, and each described port is connected to the corresponding port interface of described a plurality of storage processor node, and described back plane switch makes and can communicate between each nodes of described a plurality of storage processor node;

Array of disk drives, its connection and be distributed in the described disc drive interface of described a plurality of storage processor node, described array of disk drives is stored a plurality of titles, each title is divided into a plurality of sub-piece that is distributed in described array of disk drives, and wherein each sub-piece is stored in the disk drive of described array of disk drives;

Virtual file manager of described at least one management node operation, it manages the storage and the visit of each sub-piece of described a plurality of titles, and a plurality of directory entries of safeguarding the directory entry that comprises each title, each described directory entry comprises a sub-piece location entries tabulation, and wherein each sub-piece location entries comprises a storage processor node identifier, a disk drive identifier and the logical address that is used to locate and visit each sub-piece of each title that is stored in described array of disk drives; And

User procedures, it is executed on the storage processor node, it is to the title request of described virtual file manager submission to a selected title, it receives the respective directories item of described selected title from described virtual file manager, submit to a sub-piece to read request at each the sub-piece location entries in the described respective directories item, the request of reading of each sub-piece is sent to the storage processor node that is identified by the storage processor node identifier in the corresponding sub block location entries in the described respective directories item, and receive sub-piece, and use the sub-piece that receives to rebuild described selected title from the storage processor node that is identified by the storage processor node identifier in the corresponding sub block location entries in the described respective directories item.

2. the described Virtual File System in the claim 1, wherein the disk drive that identifies of a quilt by storage processor node that a quilt is identified provides described logical address, obtains each height piece in described a plurality of sub-piece with single searching operation.

3. the described Virtual File System in the claim 1, all told of each disk drive of wherein said array of disk drives can be used for storing described a plurality of sub-pieces of described a plurality of titles.

4. the described Virtual File System in the claim 1, wherein:

Wherein the request of reading of each sub-piece comprises a destination node identifier, described disk drive identifier and described logical address; With

Wherein said virtual file manager obtains the described respective directories item of described selected title, and described respective directories item is transmitted to described user procedures to reply described title request.

5. the described Virtual File System in the claim 4, further comprise a transmission course, it is executed in a storage processor node, it receives sub-piece and reads request, it is by using described logical address with requested sub-piece in location from the local disk drive that is identified by described disk drive identifier, ask described sub-piece, and obtained sub-piece is forwarded to the storage processor node that is identified by described destination node identifier.

6. the described Virtual File System in the claim 1, wherein each title is divided into a plurality of data blocks, each described data block comprises a plurality of sub-pieces, described a plurality of sub-piece jointly comprises the redundant data of each data block, and wherein said user procedures can be operated with the described data block of reconstruction in a plurality of sub-piece of a few sub-piece from all sub-pieces that comprise arbitrary data block.

7. the described Virtual File System in the claim 6, wherein said array of disk drives is divided into a plurality of redundant array groups, wherein each redundant array group comprises a plurality of disk drives that are distributed in a plurality of storage processor node, and wherein described a plurality of sub-pieces of each data block are distributed on a plurality of disk drives of a respective redundant array group.

8. the described Virtual File System in the claim 7, wherein said user procedures are used for rebuilding any stored title under following arbitrary situation: arbitrary disk drive failure; Arbitrary disk drive failure of each of described a plurality of redundant array groups; Any one fault with described a plurality of storage processor node.

9. the described Virtual File System in the claim 8, wherein said user procedures is used for rebuilding a sub-piece of this dropout of data block from the remaining sub-piece of described data block, and is used to return the sub-piece to of losing of described reconstruction and should obtains described storage processor node of losing sub-piece.

10. the described Virtual File System in the claim 9, wherein substituted by an alternative memory processes node should obtain described storage processor node fault of losing sub-piece the time described, the sub-piece that described alternative storage processor node receives by storage and memory loss and new title data again, the sub-piece of wherein said reception comprise and return sub-piece and rebuild sub-piece.

11. the described Virtual File System in the claim 9, further comprise memory buffer, it is connected to and describedly should obtains described described storage processor node of losing sub-piece, and temporary transient storage comprises the received sub-piece that is returned sub-piece and rebuilds sub-piece, to be used to transfer to the alternative disk drive of faulty disk driver.

12. the described Virtual File System in the claim 1, wherein each sub-piece is stored on the piece of the disk drive that described logical address identifies, and wherein said logical address comprises LBA (Logical Block Addressing).

13. the described Virtual File System in the claim 1, the storage of wherein said virtual file manager management title, wherein each title is divided into a plurality of data blocks, and each data block comprises a plurality of sub-pieces, and described a plurality of sub-pieces comprise the redundant data of each data block.

14. the described Virtual File System in the claim 1, wherein said at least one management node comprises a mirror image management node, a mirror image virtual file manager of the described virtual file manager operation of its operation mirror image.

15. the described Virtual File System in the claim 1, wherein said virtual file manager is kept the pond of a pre-assigned directory entry, and each all comprises a tabulation of available sub-piece location entries.

16. the described Virtual File System in the claim 15, the number in the pond of wherein said pre-assigned directory entry are based on performance and website uses overview.

17. a Virtual File System comprises:

Virtual file manager of described at least one management node operation, it manages the storage and the visit of each sub-piece of described a plurality of titles, and a plurality of directory entries of safeguarding the directory entry that comprises each title, each described directory entry comprises a sub-piece location entries tabulation, and wherein each sub-piece location entries comprises a storage processor node identifier, a disk drive identifier and the logical address that is used to locate and visit each sub-piece of each title that is stored in described array of disk drives;

The storage of wherein said virtual file manager management title, wherein each title is divided into a plurality of data blocks, each data block comprises a plurality of sub-pieces, described a plurality of sub-piece comprises the redundant data of each data block, and wherein said array of disk drives is divided into a plurality of redundant array groups, wherein each redundant array group comprises a plurality of disk drives that are distributed in a plurality of storage processor node, and wherein described a plurality of sub-pieces of each data block are distributed on a plurality of disk drives of a respective redundant array group.

18. the described Virtual File System in the claim 17 further comprises:

One has a plurality of alternative disk drives of losing sub-piece, and it is connected to first storage processor node;

Described virtual file manager is prepared a dish and is repaired directory entry, wherein lists the sub-piece of corresponding odd even that each is lost sub-piece and constitutes a data block, and transmits described dish and repair directory entry to described first storage processor node; And

Repair process, be executed in described first storage processor node, this repair process is repaired the sub-piece of each odd even of listing in the directory entry at described dish and is submitted to a sub-piece to read request, it loses sub-piece corresponding to each, the sub-piece of corresponding odd even that this repair process use is received is rebuild each and is lost sub-piece, and the sub-piece that storage is rebuild is to described alternative disk drive.

19. the described Virtual File System in the claim 18 further comprises:

Idle storage processor node;

The local fault disk drive is substituted by described alternative disk drive, is connected to described idle storage processor node;

Described virtual file manager is repaired directory entry at the described dish of transmission and at first sent to described idle storage processor node to described first storage processor node; With

The file rescue procedures, be executed in described idle storage processor node, its service test and with the described validity of losing sub-piece of steady arm detection of stored at described local fault disk drive, and the effective sub-piece that reads from described local fault disk drive is forwarded to described first storage processor node, to be stored on the described alternative disk drive.

20. to lose sub-piece rebuilt and when being stored in described alternative disk drive corresponding for the described Virtual File System in the claim 19, wherein said repair process, abandons the received effective sub-piece that reads from the local fault disk drive.

21. the described Virtual File System in the claim 17, wherein said array of disk drives comprise pre-piece on described a plurality of redundant array groups.

22. the described Virtual File System in the claim 21 responds the variation of the predetermined quantity of described disk drive, wherein said virtual file manager is carried out cutting procedure again, with the described a plurality of data block of reallocating to keep the mean allocation of data.

23. the described Virtual File System in the claim 22, wherein said cutting procedure again moves as background task.

24. the described Virtual File System in the claim 22, when the described disk drive predetermined quantity that wherein detects described array of disk drives increases, described virtual file manager is carried out described cutting procedure again, come the described a plurality of data blocks of reallocation in the new disk drive of described array of disk drives, to keep the mean allocation of data.

25. the described Virtual File System in the claim 22, wherein said virtual file manager detects the request of the given disc driver of removing described array of disk drives, carry out described cutting procedure again with the described a plurality of data block of reallocating, in remaining disk drive, keeping the data mean allocation, and remove the disk drive that distributes described appointment.