WO1996014617A1 - Multicomputer system and method - Google Patents

Multicomputer system and method Download PDF

Info

Publication number
WO1996014617A1
WO1996014617A1 PCT/US1994/012921 US9412921W WO9614617A1 WO 1996014617 A1 WO1996014617 A1 WO 1996014617A1 US 9412921 W US9412921 W US 9412921W WO 9614617 A1 WO9614617 A1 WO 9614617A1
Authority
WO
WIPO (PCT)
Prior art keywords
clm
computer
set forth
multicomputer system
program
Prior art date
Application number
PCT/US1994/012921
Other languages
French (fr)
Inventor
Yuan Shi
Original Assignee
Temple University - Of The Commonwealth System Higher Education
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Temple University - Of The Commonwealth System Higher Education filed Critical Temple University - Of The Commonwealth System Higher Education
Priority to PCT/US1994/012921 priority Critical patent/WO1996014617A1/en
Priority to JP8515266A priority patent/JPH10508714A/en
Priority to AU11746/95A priority patent/AU1174695A/en
Priority to EP95902495A priority patent/EP0791194A4/en
Publication of WO1996014617A1 publication Critical patent/WO1996014617A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/456Parallelism detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4494Execution paradigms, e.g. implementations of programming paradigms data driven

Definitions

  • This invention relates to a system for connecting a plurality of computers to a network for parallel processing. DESCRIPTION OF THE RELEVANT ART
  • Systems with multiple processors may be divided into two categories: systems with physically shared memory, called multi-processors, and systems without physically shared memory, called multi-computers.
  • the execution mode for sequential machines is single instruction, single data (SISD)
  • SISD computer operates a single instruction, I, on a single datum, D, one at a time, in an arrangement commonly known as the von
  • the first method includes single instruction, multiple data (SIMD) computing, known as vector processing.
  • SIMD single instruction, multiple data
  • processors are loaded with the same set of instructions, but each processor operates on a different set of data.
  • SIMD computing has each processor calculating, for a set of data, D 1 , D 2 , D 3 , D 4 , using the same instruction set I, in
  • the second method of parallel computing is multiple instruction, multiple data (MIMD) computing.
  • MIMD computing has different data processed by different instruction sets, as indicated in FIG. 3.
  • MIMD computing breaks the execution by a parallel software into pieces, thereby providing multiple instruction sets or multiple processors, I 1 , I 2 , I 3 , I 4 .
  • MIMD computing there are multiple sets, D 1 , D 2 , D 3 , D 4 , and each data set respectively is fed into a
  • the third method of parallel computing is multiple instruction, single data (MISD) computing, commonly known as a pipeline system, as shown in FIG. 4.
  • MISD multiple instruction, single data
  • FIG. 4 data D 1 , D 2 , D 3 , D 4 , go into instruction set I 1 , I 2 , I 3 , I 4 .
  • the second data D 2 go into the first processor I 1
  • the first data D 1 having been processed by the first processor I 1 , go into the second processor I 2 .
  • MISD computing can contribute to the overall efficiency only when there are at least two (2) input instances to the pipe intake with maximal k times speedup where k is the number of pipe stages.
  • a general object of the invention is a multicomputer system and method for parallel execution of application programs.
  • Another object of the invention is to construct a scalable, fault tolerant and self-scheduling computer architecture for multi-computer systems.
  • connectionless computing method are provided for use with a plurality of computers and a backbone.
  • the multicomputer system is thus called connectionless machine (CLM).
  • CLM connectionless machine
  • the CLM backbone has a unidirectional slotted ring structure but can be simulated, with less efficiency, by any interconnection network that provides point-to-point, broadcast and
  • the CLM backbone may use most types of medium for interconnecting the plurality of computers.
  • At least one of the computers and in general several computers, sends labeled messages, in data packets, over the CLM backbone.
  • the messages may be sets of instructions I,, I 2 , I 3 , I 4 , . . ., or sets of data, D 1 , D 2 , D 3 , D 4 , . . .
  • each computer receives a full set of
  • partitioned segments belonging to one application program Each program segment, or a subset of a partitioned program, consumes and produces labeled data messages.
  • a computer having no work to perform is considered to be sitting idle.
  • the first computer encountering a message that matches the requirements of a program segment may exclusively receive that message.
  • the program segment on the first computer is enabled when all required labeled data messages are
  • the first computer When the first computer generates new data, they are sent as new data messages which may activate many other computers. This process continues until all data messages are processed.
  • FIG. 1 illustrates a single instruction, single data machine
  • FIG. 2 illustrates a single instruction, multiple data machine
  • FIG. 3 illustrates a multiple instruction, multiple data machine
  • FIG. 4 illustrates a multiple instruction, single data machine
  • FIG. 5 illustrates a plurality of computing machines employing a tuple space
  • FIG. 6 illustrates the CLM architecture
  • FIG. 7 is a MIMD data flow example
  • FIG. 8 is a SIMD component example
  • FIG. 9 illustrates a CLM compiler and operating system interface
  • FIGS. 10A and 10B illustrates the CLM compilation principle
  • FIGS. 11A and 11B illustrates the concept of
  • FIG. 12 shows a L-block master section procedure
  • FIG. 13 shows a L-block worker section procedure
  • FIG. 14 shows a R-block master section procedure
  • FIG. 15 shows a R-block worker section procedure
  • FIG. 16 illustrates an exclusive-read deadlock
  • FIG. 17 shows a CLM operating system extension
  • FIG. 18 illustrates a method of the present invention for processing and executing application programs
  • FIG. 19 shows a conceptual implementation of a CLM processor with a backbone interface
  • FIG. 20 illustrates an implementation of CLM with three processors
  • FIG. 21 illustrates a CLM construction with two
  • the multicomputer system and method has a plurality of computers connected to a connectionless machine (CLM) backbone.
  • CLM connectionless machine
  • Each computer connected to the CLM backbone has a local memory for receiving data packets transmitted on the CLM backbone.
  • the plurality of computers seek or contend to receive data packets, and thus receive messages, from the CLM backbone.
  • a condition may exist where the CLM backbone becomes saturated with data packets, i.e., all the slots on the CLM backbone are full of data packets. The saturation will be automatically resolved when some computers become less busy. If all computers stay busy for an extended period of time, adjustments to the computer processing powers and local write-buffer sizes can prevent or resolve this condition.
  • a computer is defined as a device having a central processing unit (CPU) or a processor.
  • the computers in a CLM system may be heterogeneous and homogeneous, and each computer has a local memory and optional external storage.
  • Each computer preferably has more than 0.5 million floating point operations per second (MFLOPS) processing power.
  • MFLOPS floating point operations per second
  • a CLM backbone is defined as having a unidirectional slotted ring structure or simulated by any interconnection network that can perform or simulate point-to-point, broadcast and exclusive-read operations.
  • the network can use fiber optical or copper cables, radio or micro waves, infra-red signals or any other type of medium for
  • a message is defined to be a labeled information entity containing one or more data packets.
  • a message may include at least one set of instructions or data.
  • the term 'data tuple' is used to refer to 'labeled message'.
  • the set of instructions operates on at least one set of data.
  • a data packet is defined to be a limited number of bits containing useful information and a label for program segment matching purpose.
  • a data packet is the basic building block for transmitting messages through the CLM backbone.
  • a broadcast is defined to be a communication
  • a network device typically includes a computer; however, any device in general may be connected to the CLM backbone.
  • a multicast is defined to be a communication
  • originating from a network device to group of selected network devices with a specially arranged addresses.
  • broadcast is used for identifying both broadcasts and multicasts unless indicated differently.
  • the CLM multicomputer system is a multicomputer
  • processor architecture for providing a solution to problems in multi-processor (multi-CPU) system in scheduling, programmability, scalability and fault tolerance.
  • multi-CPU multi-processor
  • the following description is of the CLM architecture, operating principle and the designs for a CLM compiler and an
  • a multicomputer system and method are provided for use with a plurality of computers, and a CLM backbone 120.
  • the CLM multicomputer system uses the CLM backbone 120 as a medium for transmitting the plurality of labeled data packets corresponding to the message, for transmitting processed data packets, and for transmitting EXCLUSIVE-READ signals.
  • the CLM backbone 120 may be a local area network or a wide area network, with the CLM backbone 120 using any type of medium for interconnecting the computers.
  • the medium may be cable, fiber optics, parallel wires, radio waves, etc.
  • the CLM architecture is based on a parallel computing model called a Rotating Parallel Random Access Machine (RPRAM).
  • RPRAM Rotating Parallel Random Access Machine
  • the common bus for connecting parallel processors and memories in conventional Parallel Random Access Machine (PRAM) systems is replaced by a high speed rotating
  • unidirectional slotted ring i.e. the CLM backbone 120, in the RPRAM.
  • a CLM system operates according to the dataflow principle; i.e. every application program is represented as a set of labeled program segments, or a subset of an application program, according to the natural data
  • the CLM backbone 120 may transmit data tuples, embodied as labeled data packets, as initial data tuples of the application program, in the backbone.
  • a computer completing the processing of a data tuple returns the results of the processing into the CLM backbone 120 in the form of new data tuples.
  • a new set of computers may be triggered by the new data tuples to process the new data tuples, and the processing of initial and new data tuples continues until the application program is completed.
  • the CLM multicomputer system executes an application program including a plurality of program segments, with each program segment labeled according to data dependencies or parallelism among the program segments, and with each program segment connected to other programs segments by a corresponding labeled data tuple.
  • the CLM backbone 120 responsive to an execution command, transmits the labeled program segments to the plurality of computers to load the labeled program segments into each computer of the plurality of computers.
  • the CLM backbone 120 also circulates the labeled data tuples through the plurality of computers. At least one of the plurality of computers, when it receives a matching set of data tuples, activates the program segment corresponding to the labeled data tuples, processes the activated program segments, and transmits on the CLM
  • the CLM backbone 120 continues to transmit the labeled data tuples to the
  • a coarse grain processor vector or SIMD forms when a number of computers are activated upon the reading of vetorized data tuples, tuples of the same prefix, thereby triggering the same program segment existing on these computers.
  • a coarse grain MIMD processor forms when
  • processor pipelines stabilize to provide an additional acceleration for the inherently sequential parts of the application.
  • FIG. 7 illustrates the formation of a coarse grain MIMD processor.
  • computers 100, 101, 102 coupled to the CLM backbone 120
  • computers 100, 101 execute program segments SI and S2, respectively, in parallel as MIMD, followed by computer 102 executing segment S3.
  • program segments in this example use the EXCLUSIVE-READ operation since the original dataflow does not have the sharing of data elements.
  • FIG. 8 illustrates the formation of coarse grain SIMD processor.
  • computers 100, 101, 102 coupled to the CLM backbone 120
  • computers 100, 101, 102 execute program segments Si and S2 in parallel as SIMD.
  • Computer 100 exclusively reads tuple a, so segment SI is executed only once on 100.
  • Segment SI generates 1000 vectorized tuples labeled as 'd1' to 'd1000', and triggers at least three computers (waiting for 'd*'s) to run in parallel in an extended version of conventional vector processors; i.e. an SIMD component.
  • S3 may only be activated in one of the three computers due to EXCLUSIVE-READS.
  • a set of sequentially dependent program segments executed by the computers of the CLM system may form a coarse grain processor pipeline. If the data tuples continue to be input, the pipeline may contribute to a potential speedup of computations.
  • Multiple parallel segments may execute at the same time and compete for available data tuples.
  • Multiple backbone segments can transmit simultaneously. Labeled tuples may circulate in the backbone until either consumed or recycled.
  • the unidirectional high speed CLM backbone 120 fuses the active data tuples from local memories into a rotating global memory.
  • the CLM backbone 120 circulates only active data tuples required by the processing elements at all times. Thus, if the overall processing power is balanced with the backbone capacity, the worst tuple matching time is the maximum latency of the backbone.
  • the multicomputer scheduling problem is transformed into a data dependency analysis problem and an optimal data granule size problem at the expense of redundant local disk space or memory space.
  • MIMD and pipeline segments are natural products of the program dependency analysis.
  • the parallelization of SIMD segments requires dependency tests on all repetitive entities (loops or recursive calls). For example, any upwardly independent loop is a candidate for parallelization. All candidates are ranked by the
  • the parallelized loop uses a variable data granule size. Its adjustments affect the overall computing versus communications ratio in this loop's parallelization. A heuristic algorithm has been developed to adjust this size dynamically to deliver reasonable good performance. The best ratio gives an optimal performance. Recursive
  • any sequential program may automatically distribute and parallelize onto the CLM architecture.
  • the CLM compilation is independent of the number of computers in a CLM system.
  • a CLM compiler generates codes for execution regardless of the number of computers.
  • Application's natural structure also has an impact on performance. For example, an application having many duplicably decomposable segments performs better than the applications having few duplicably decomposable segments. Applications with wider range of data granule sizes would perform better than applications with narrow data granule ranges. The impact is most obvious when the CLM processor powers vary widely.
  • the CLM backbone capacity and the aggregate computer power also limit the maximum performance deliverable by a CLM system for an application.
  • the theoretical power limit of a CLM may be modeled by the following equation:
  • P is the CLM theoretical performance limit in MFLOPS
  • CD is the average computation density of an application in floating point operations per byte (FLOPB)
  • DD is the average data density of an application; i.e. total input, output, and intermediate data in millions of bytes preferably transmitted in one time unit (second);
  • R is the CLM ring capacity in MBPS
  • P i is the individual CLM processor power in millions of floating point operations per second (MFLOPS).
  • a distributed CLM system may include up to 10,000 computers with a maximum round trip latency less than 8 seconds. With 100 ns per computer delay, a 32,000 computer-centralized CLM system has a 3.2 millisecond maximum round trip latency. Any application requiring more than 10 seconds of computing time can benefit from either CLM.
  • the CLM software architecture includes a compiler and an operating system extension, and, in the preferred
  • every computer of the CLM system runs a multiprogrammed operating system, such as Unix(tm), VMS(tm) and OS2 (tm).
  • the CLM compiler analyzes the data dependency and embedded parallelism; i.e. the sub-program duplicability, of a given sequential program.
  • the CLM compiler generates a set of partitioned program segments and a program-segment-to-data-tuple mapping.
  • the CLM operating system extension uses the mapping to provide a fast searchless tuple matching mechanism.
  • a CLM system is a data reduction machine that uses parallel processors to reduce the natural dataflow graph of a computer application.
  • the embedded parallelism of the application is defined by the CLM compiler.
  • the CLM architecture uses space redundancy to trade-off scheduling efficiency, programmability and fault tolerance in the present invention .
  • the CLM compiler is a sequential program re-structurer; i.e. the CLM compiler takes a sequential program description as an input in a specific programming language; for example, the C programming language. In use, the CLM compiler generates a set of distributable independent program
  • DTPS_TBL data-tuple-to-program-segment matching table
  • the local CLM operating system extensions load the DTPS_TBL, and prepare each computer for a subsequent RUN command.
  • FIG. 9 illustrates the relationship between the CLM compiler 555 and the CLM operating system (O/S) extension 560.
  • the CLM 'run X' command sends application X's data tuples to all CLM processors through the O/S extension 560.
  • the compile-link 580 means that the CLM compiler can
  • FC Force Copy
  • the CLM compilation command at the operating system level has the following syntax:
  • the -D option designates a data density threshold for determining the worthiness of parallelization of repetitive program segments.
  • the -G option designates a grain size value for optimizing the difference.
  • the -F option activates fault tolerance code generation.
  • the -R option may impact the overall performance due to the extra overhead introduced in fault detection and recovery.
  • the time-out value of -F may range from microseconds to 60 seconds.
  • the -V option controls the depth of vectorization. A greater value of V may be used to obtain better performance when a large number of CLM processors are employed.
  • the -F option defines the factoring percentage used in a heuristic tuple size
  • CLM compiler commands for other programming languages may have similar syntax structure.
  • FIGS. 10A and 10B illustrate the CLM compiler operating principle. Every sequentially composed program, regardless the number of sub-programs, can be transformed into one linear sequence of single instructions.
  • the repetitive segments indicate the natural partitions of the program. Every partitioned segment can be converted into an independent program by finding input and output data structures of the partitioned segment.
  • the repetitive segments may be further parallelized using the SIMD principle. For example, for an iterative segment, independent loop instances are considered duplicable on all CLM processors. Then a data vectorization is performed to enable the simultaneous activation of the same loop core on as many as possible CLM processors, i.e., dynamic SIMD processors.
  • a recursive state tuple is constructed to force the duplicated kernels to perform a breadth-first search over the implied recursion tree.
  • the CLM compiler 555 operates in the following phases: Data Structure Initialization, Blocking, Tagging,
  • Data structure initialization if the input specification is syntactically correct, the data structure initialization phase builds an internal data structure to store all information necessary to preserve the semantics of the sequential program, including:
  • a linear list of statements for the main program body is the primary output of the data structure initialization phase. The list, all procedures and functions have been substituted in line.
  • Blocking the blocking phase scans the linear statements and blocks the linear statements into segments according to the following criterion:
  • L-Block Vectorization the L-Block vectorization phase analyzes the duplicability of each L block, and performs the following tasks:
  • the L-Block vectorization phase may optionally attempt to order the number of
  • v. Calculate the computational density D of the L block by counting the number of fixed and floating point operations, including the inner loops. If D is less than a threshold specified by the -D option at the
  • Vth loop Starting from the outermost loop, mark the Vth loop as an SIMD candidates, with V specified at
  • R-Block Vectorization the R-block vectorization phase is responsible for analyzing the duplicability of recursive functions and procedures by performing the
  • S-Block processing is performed for enhancing the efficiency of the potential processor pipelines if the intended program is to operate on a continuous stream of data, as specified by the presence of the -P option at compilation command.
  • Each S-block is inspected for the total number of statements contained in the S-block, and all S-blocks are to be spliced into segments less than or equal to the value given by the -P option.
  • ii. Collect all data structures returned by each block.
  • the returned data structures include all global data structures referred on the left-hand-side of all statements in the block and the modified parameters returned from functions/procedure calls;
  • Formulate tuples the formulation of tuples process starts from the output of each block. For each block, the optimal number of output tuple equals to the number of sequentially dependent program segments, thereby preserving the original parallelism by retaining the
  • Every output tuple is assigned a tuple name.
  • the tuple name consists of the sequential program name, the block name, and the first variable name in the collection of output variables.
  • the assigned names are to be propagated to the inputs of all related program segments;
  • variable ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
  • X is modified by three blocks:
  • X_1 will be EXCLUSIVE- READ by S2 with updated value in X_2. The similar process
  • FIGS. 11A and 11B assume the same definition as X.
  • each SIMD each L-block is further spliced into two sections: a master section and a worker section.
  • the master section contains statements for scattering the vectorized data tuples and for gathering the resulting data tuples.
  • the globally referenced data structures in worker section are structured as READ tuples in the worker code.
  • the vectorized tuples are scattered in G groups, where G is calculated according to the following loop scheduling algorithm, developed based on "Factoring - - A Method for Scheduling Parallel Loops," by S.F. Hummel, E. Schonberg and L.E. Flynn, Communications of ACM, Vol., No. 8, 90-101, August 1992.
  • the disclosing loop scheduling algorithm was developed based on the algorithm published in, "A Virtual Bus Architecture for Dynamic Parallel
  • K.C. Lee discussed a modular time-space-time network
  • Switches comprising multiple time division networks connected by a nonblocking space division network (switches). Except for the exclusive-read implementation, the rest is compatible with the multiple backbone CLM architecture.
  • the tuple size G(Sj's) is defined as follows:
  • the value of P can be automatically determined by the CLM backbone management layer and transmitted to the program at execution time.
  • the value T is from the -T option at the compilation command line.
  • the worker section is padded with statements for retrieving tuples in the beginning of the worker section, and with statements for packing the results into new tuples at the end of the worker section.
  • the new spliced segments are distinguished by suffixes M and W respectively.
  • the master section performs the reading and getting 200 all the
  • the worker section performs the reading 310 of global data tuples
  • vectorized data 335 (T,Sj,#) to the result tuple; putting 350 the result tuple to backbone; checking 355 if Sj>T then loop back to get a vectorized tuple 320; otherwise ending 360.
  • each duplicable R-block is spliced into two sections: the master section and a worker section.
  • the master section contains statements for generating an initial state tuple and a live tuple count TP_CNT, with TP_CNT initially set to 1.
  • the initial state tuple is a collection of all data structures required for subsequent computation.
  • the master section also has statements for collecting the results. As shown in FIG. 14, the master section operates using the procedure of assembling 365 a state tuple from the collection of all read-only dependent data structures in the R-block, with the state tuple
  • ST_TUPLE (G, d1, d2. ..., dk) where G is the grain size given at compilation time using the -G option.
  • the master section also creates a result tuple using the globally updatedd data structures and return data structures via syntactical analysis of the recursion body. It puts the result tuple into the backbone.
  • the master section generates 370 a live tuple count TP_CNT; sets 375 the live tuple count TP CNT to 1; puts 380 the state tuple and the live tuple count to the CLM backbone 120; gets 385 a term tuple from the CLM backbone 120; gets 390 a result tuple from the CLM backbone 120; unpacks 395 the result tuple; and returns 400 the unpacked results.
  • the worker section gets 405 the state tuple ST_TUPLE from the CLM backbone 120; unpack 410 the state tuple calculates 415 using the state tuple;
  • the worker section also updates 425 the result tuple during the calculation via exclusive-reads and writes to the backbone. It then creates 440 N new state tuples according to the results of the calculations. It gets 445 TP_CNT from the backbone and sets it to TP_CNT + N -1. If TP_CNT becomes 455 zero, then it creates 460 a "term" tuple and puts it into the backbone. Otherwise, it puts 475 N new state tuples into the backbone and loops back to the beginning.
  • EXCLUSIVE-READ deadlock prevention with the prevention performed by processing the multiple-EXCLUSIVE-READ-blocks.
  • K exclusive-read input tuples
  • K > 1 implements the protocol as shown in FIG. 16, where a count CNT is set 500 to equal 1; input tuples are read 505 from the CLM backbone 120; check 510 if the input tuple is to be read, as opposed to being exclusively read. If the input tuple is to be read, the input tuple is read 515. Otherwise, check 520 if the input tuple is to be exclusively read and if the count CNT is less than K. If the input tuple is to be exclusively read and count CNT is less than K, then the input tuple is read 525 and the count is incremented by setting 530 CNT equal to CNT + 1.
  • the input tuple is exclusively read but the count CNT is greater than or equal to K, the input tuple is exclusively read 535.
  • the procedure illustrated in FIG. 16 prevents possible deadlocks when K exclusive tuples are acquired by L different program segments on L computers, with L > 1, with the deadlock resulting in no exclusive tuples progressing.
  • Map generation For each program segment, generate the initial input tuple names and mark the initial input tuple names with RD to be read, or ERD to be
  • the initial input tuples in the map should NOT contain the tuples belonging to the gathering parts, such as 245 and 285 in FIG. 12 and 385 and 390 in FIG. 14. These tuples are to be enabled after the completion of the scattering part.
  • the CLM O/S extension 560 contains the following elements:
  • Event Scheduler 585 being a program using CLM extended TCP/IP to interpret the network traffic.
  • the Event Scheduler 585 branches into four different servers: a
  • PMS Program Management Server
  • DTS Data Token Server
  • GPMS General Process Management Server
  • COSI Conventional Operating System Interface
  • PMS Program Management Server
  • the program storage and removal functions act as a simple interface with existing O/S file systems. After receiving an activation command or a RUN command for an application program, the PMS builds a memory image for every related segments:
  • the segment image can contain only the Trigger Tuple Name Table and The Disk Address entry. Similar to a demand paging concept, the segment with a matching tuple is fetched from local disk to local memory.
  • the trigger tuple name table size is
  • adjustable at system generation time It also creates an indexed table (MTBL) from DTPS_TBL to the newly created segment images according to the referenced tuple names.
  • MTBL indexed table
  • DTS Data Token Server
  • General Process Management Server (GPMS) 620 for managing the KILL, SUSPEND, and RESUME commands, as well as general process status (GPS) commands for the CLM processes.
  • GPMS General Process Management Server
  • COSI Operating System Interface
  • the present invention uses a method for executing an application program including a plurality of program
  • the method includes the steps of receiving 1040 an execution command; transmitting 1045, on the CLM backbone 120, the labeled program segments to the plurality of computers; loading 1050 the labeled program segments into each computer of the plurality of computers; transmitting 1055, on the CLM backbone 120, the labeled data tuples to the plurality of computers; receiving 1060, at a receiving computer, a labeled data tuple; activating 1065, at the receiving computer, in response to the labeled data tuple, the program segments corresponding to the labeled data tuple; processing 1070, at the receiving computer, the activated program segments; transmitting 1075, on the CLM backbone 120, the results of the processing of the program segments as labeled data tuples; continuing 1080 to
  • FIG. 19 illustrates CLM Processor and Single Backbone
  • the S bit indicates the availability of register to the
  • the Shift_Clock controls the backbone rotating frequency.
  • the shifting register can hold typically 1024 bytes of information.
  • a) Purge of returned messages This protocol checks the register content against BUFFER_OUT. If there is a match and if the message is not EXCLUSIVE-READ or the message is not the last packet in an EXCLUSIVE-READ message, then the content in BUFFER_OUT is purged. This empties the message slots. A returned EXCLUSIVE-READ message (or the last packet of the message) will keep circulating until it is consumed. b) Tuple name matching.
  • a data tuple in Register contains an index generated by the compiler. This index is also recorded in the DTPS_TBL and local TTNTs. A simple test using the index in the local TTNTs can determine the
  • EXCLUSIVE-READ deadlock avoidance When more than one CPU exclusively read one of many tuples belonging to the same program segment, or more than on CPU exclusively read some of the packets belonging to the same tuple, none of the CPUs will be ever completely matched. The deadlock is avoided by only exclusively read the last EXCLUSIVE-READ tuple or the last packet of an EXCLUSIVE-READ tuple in a TTNT.
  • FIG. 20 illustrates CLM with three Processors and A Single Backbone. In this structure, point-to-point,
  • broadcast and exclusive-read of any message can be performed by any processors.
  • BUFFER_OUTS On all processors are full, the system enters a "lock state".
  • the 'lock state' may be automatically unlocked when some of the CPUs become available for computing.
  • the system enters a "deadlock state" when all CPUs are blocked for output and no empty slot is available on the backbone.
  • FIG. 21 illustrates the CPU-to-backbone interface for a CLM system with two backbones. Both READ and WRITE
  • the backbone initialization command sets all message (register) headers to 0.
  • processors are not guaranteed to share the backbone in a "fair" fashion - - the closer neighbors of the sender will be busier than those are further away. In general, this should not affect the overall CLM performance since when all the closer neighbors are busy the further neighbors will get something to do eventually.
  • Patent Filing Number: 08/029,882 A MEDIUM ACCESS CONTROL PROTOCOL FOR SINGLE-BUS MULTIMEDIA FAIR ACCESS LOCAL AREA NETWORKS, by Zheng Liu, can give both the fairness property and the multi-media capabilities.
  • the present invention also demonstrates a feasible design of a single backbone and a multiple-backbone CLM system.
  • the disclosed protocols illustrate the principles for implementing:
  • the present invention automatically partitions any sequential program into program segments and uses a method for executing an application program including a plurality of inter-relating program segments, with each program segment labeled according to data dependencies and
  • the method includes the steps of receiving 1040 an execution command; transmitting 1045, on the CLM backbone 120, the labeled program segments to the plurality of computers;

Abstract

A multicomputer system and method for automatic sequential-to-parallel program partition, scheduling and execution. The plurality of computers (100, 101, 102, 103) are connected via a uni-directional slotted ring (120) (backbone). The ring (120) supports, with deterministic delays, point-to-point, broadcast and EXCLUSIVE-READ operations over labeled tuples. The execution of parallel programs uses a connectionless computing method that forms customized coarse grain SIMD, MIMD and pipelined processors automatically using the connected multicomputers. The disclosed multicomputer architecture is also resilient to processor and backbone failures.

Description

MULTICOMPUTER SYSTEM AND METHOD
BACKGROUND OF THE INVENTION
This invention relates to a system for connecting a plurality of computers to a network for parallel processing. DESCRIPTION OF THE RELEVANT ART
As computers have become less expensive, the interest in linking multiple computers to effect powerfully
distributed and parallel multicomputer systems has
increased. Systems with multiple processors may be divided into two categories: systems with physically shared memory, called multi-processors, and systems without physically shared memory, called multi-computers.
For single computers, the execution mode for sequential machines is single instruction, single data (SISD)
computing. As illustrated in FIG. 1, a SISD computer operates a single instruction, I, on a single datum, D, one at a time, in an arrangement commonly known as the von
Neumann architecture.
For parallel processing, Flynn's Taxonomy classifies three methods of computing. The first method includes single instruction, multiple data (SIMD) computing, known as vector processing. As shown in FIG. 2, with SIMD computing, processors are loaded with the same set of instructions, but each processor operates on a different set of data. SIMD computing has each processor calculating, for a set of data, D1, D2, D3, D4, using the same instruction set I, in
parallel.
The second method of parallel computing is multiple instruction, multiple data (MIMD) computing. MIMD computing has different data processed by different instruction sets, as indicated in FIG. 3. MIMD computing breaks the execution by a parallel software into pieces, thereby providing multiple instruction sets or multiple processors, I1, I2, I3, I4. With MIMD computing, there are multiple sets, D1, D2, D3, D4, and each data set respectively is fed into a
separate processor, I1, I2, I3, IA, respectively. The third method of parallel computing is multiple instruction, single data (MISD) computing, commonly known as a pipeline system, as shown in FIG. 4. In FIG. 4, data D1, D2, D3, D4, go into instruction set I1, I2, I3, I4. By the time data D1 are processed in a first processor, I1, the second data D2 go into the first processor I1, and the first data D1, having been processed by the first processor I1, go into the second processor I2. MISD computing can contribute to the overall efficiency only when there are at least two (2) input instances to the pipe intake with maximal k times speedup where k is the number of pipe stages. SIMD
computing is the most effective approach to parallel
processing since every computing bound program must have at least one repetitive segment which consumes most of the time and may be vectorized.
In D. Gelernter, "GETTING THE JOB DONE", BYTE, Nov.
1988, pp. 301-6, the concept of a tuple space is developed, where data are put into a virtual bag of tuples, as shown in FIG. 5. In the approach employed by Gelernter using a system called Linda, all of the computers of the Linda system are loaded with some computing intensive subprograms, i.e., workers, and the computers access the tuple space looking for work to perform, in a manner similar to SIMD computing. SUMMARY OF THE INVENTION
A general object of the invention is a multicomputer system and method for parallel execution of application programs.
Another object of the invention is to construct a scalable, fault tolerant and self-scheduling computer architecture for multi-computer systems.
According to the present invention, as embodied and broadly described herein, a multicomputer system and
connectionless computing method are provided for use with a plurality of computers and a backbone. The multicomputer system is thus called connectionless machine (CLM). The CLM backbone has a unidirectional slotted ring structure but can be simulated, with less efficiency, by any interconnection network that provides point-to-point, broadcast and
EXCLUSIVE-READ operations. Conventional networks, such as multi-bus systems and shared memory systems as well as local and wide area networks, can also function as a CLM backbone without the multi-media capability by a software
implementation of a tuple space. The CLM backbone may use most types of medium for interconnecting the plurality of computers.
At least one of the computers, and in general several computers, sends labeled messages, in data packets, over the CLM backbone. The messages may be sets of instructions I,, I2, I3, I4, . . ., or sets of data, D1, D2, D3, D4, . . . At initialization, each computer receives a full set of
partitioned segments belonging to one application program. Each program segment, or a subset of a partitioned program, consumes and produces labeled data messages. A computer having no work to perform is considered to be sitting idle. The first computer encountering a message that matches the requirements of a program segment may exclusively receive that message. The program segment on the first computer is enabled when all required labeled data messages are
received.
When the first computer generates new data, they are sent as new data messages which may activate many other computers. This process continues until all data messages are processed.
Additional objects and advantages of the invention are set forth in part in the description which follows, and in part are obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention also may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate preferred embodiments of the invention, and together with the description serve to explain the principles of the invention.
FIG. 1 illustrates a single instruction, single data machine;
FIG. 2 illustrates a single instruction, multiple data machine;
FIG. 3 illustrates a multiple instruction, multiple data machine;
FIG. 4 illustrates a multiple instruction, single data machine;
FIG. 5 illustrates a plurality of computing machines employing a tuple space;
FIG. 6 illustrates the CLM architecture;
FIG. 7 is a MIMD data flow example;
FIG. 8 is a SIMD component example;
FIG. 9 illustrates a CLM compiler and operating system interface;
FIGS. 10A and 10B illustrates the CLM compilation principle;
FIGS. 11A and 11B illustrates the concept of
serialization of global variable updates;
FIG. 12 shows a L-block master section procedure;
FIG. 13 shows a L-block worker section procedure;
FIG. 14 shows a R-block master section procedure;
FIG. 15 shows a R-block worker section procedure;
FIG. 16 illustrates an exclusive-read deadlock
prevention protocol;
FIG. 17 shows a CLM operating system extension;
FIG. 18 illustrates a method of the present invention for processing and executing application programs;
FIG. 19 shows a conceptual implementation of a CLM processor with a backbone interface; FIG. 20 illustrates an implementation of CLM with three processors; and
FIG. 21 illustrates a CLM construction with two
parallel backbones. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Reference is now made in detail to the present
preferred embodiments of the invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals indicate like elements throughout the several views.
The multicomputer system and method has a plurality of computers connected to a connectionless machine (CLM) backbone. Each computer connected to the CLM backbone has a local memory for receiving data packets transmitted on the CLM backbone. The plurality of computers seek or contend to receive data packets, and thus receive messages, from the CLM backbone.
The backbone READ and WRITE operations on each CLM computer are independent of the computer's processor
functions. A condition may exist where the CLM backbone becomes saturated with data packets, i.e., all the slots on the CLM backbone are full of data packets. The saturation will be automatically resolved when some computers become less busy. If all computers stay busy for an extended period of time, adjustments to the computer processing powers and local write-buffer sizes can prevent or resolve this condition.
Definitions
A computer is defined as a device having a central processing unit (CPU) or a processor. The computers in a CLM system may be heterogeneous and homogeneous, and each computer has a local memory and optional external storage. Each computer preferably has more than 0.5 million floating point operations per second (MFLOPS) processing power.
A CLM backbone is defined as having a unidirectional slotted ring structure or simulated by any interconnection network that can perform or simulate point-to-point, broadcast and exclusive-read operations. The network can use fiber optical or copper cables, radio or micro waves, infra-red signals or any other type of medium for
communicating.
A message is defined to be a labeled information entity containing one or more data packets. A message may include at least one set of instructions or data. The term 'data tuple' is used to refer to 'labeled message'. The set of instructions operates on at least one set of data.
A data packet is defined to be a limited number of bits containing useful information and a label for program segment matching purpose. A data packet is the basic building block for transmitting messages through the CLM backbone.
A broadcast is defined to be a communication
originating from a network device, to all the other network devices. A network device typically includes a computer; however, any device in general may be connected to the CLM backbone.
A multicast is defined to be a communication
originating from a network device, to group of selected network devices with a specially arranged addresses.
Throughout the following description of the invention, the term "broadcast" is used for identifying both broadcasts and multicasts unless indicated differently.
CLM Multicomputer Architecture and Method
The CLM multicomputer system is a multicomputer
processor architecture for providing a solution to problems in multi-processor (multi-CPU) system in scheduling, programmability, scalability and fault tolerance. The following description is of the CLM architecture, operating principle and the designs for a CLM compiler and an
operating system extension.
In the exemplary arrangement shown in FIG. 6, a multicomputer system and method are provided for use with a plurality of computers, and a CLM backbone 120.
The CLM multicomputer system uses the CLM backbone 120 as a medium for transmitting the plurality of labeled data packets corresponding to the message, for transmitting processed data packets, and for transmitting EXCLUSIVE-READ signals. The CLM backbone 120 may be a local area network or a wide area network, with the CLM backbone 120 using any type of medium for interconnecting the computers. The medium may be cable, fiber optics, parallel wires, radio waves, etc.
The CLM architecture is based on a parallel computing model called a Rotating Parallel Random Access Machine (RPRAM). The common bus for connecting parallel processors and memories in conventional Parallel Random Access Machine (PRAM) systems is replaced by a high speed rotating
unidirectional slotted ring, i.e. the CLM backbone 120, in the RPRAM.
A CLM system operates according to the dataflow principle; i.e. every application program is represented as a set of labeled program segments, or a subset of an application program, according to the natural data
dependencies and duplicabilities of the program segments. The only connections between the program segments are matching labeled data tuples. A CLM RUN command loads program segments to all computers. The sequential
processing requirements of the original program are
preserved by discriminating the EXCLUSIVE-READ operation from a READ operation. The CLM backbone 120 may transmit data tuples, embodied as labeled data packets, as initial data tuples of the application program, in the backbone. A computer completing the processing of a data tuple returns the results of the processing into the CLM backbone 120 in the form of new data tuples. A new set of computers may be triggered by the new data tuples to process the new data tuples, and the processing of initial and new data tuples continues until the application program is completed. The following discussion refers to program segments, with the understanding to one skilled in the art that the principles taught herein apply as well to a subset of a partitioned program.
The CLM multicomputer system executes an application program including a plurality of program segments, with each program segment labeled according to data dependencies or parallelism among the program segments, and with each program segment connected to other programs segments by a corresponding labeled data tuple. The CLM backbone 120, responsive to an execution command, transmits the labeled program segments to the plurality of computers to load the labeled program segments into each computer of the plurality of computers. The CLM backbone 120 also circulates the labeled data tuples through the plurality of computers. At least one of the plurality of computers, when it receives a matching set of data tuples, activates the program segment corresponding to the labeled data tuples, processes the activated program segments, and transmits on the CLM
backbone 120 the results of the processing of the program segment as new labeled data tuples. The CLM backbone 120 continues to transmit the labeled data tuples to the
plurality of computers and the plurality of computers continues to process the program segments corresponding to the labeled data tuples until the entire application program is processed.
Using the dataflow principle, a set of global data structures are read by processors to prepare for the
execution of a subsequent mix of coarse grain SIMD, MIMD and MISD components.
A coarse grain processor vector or SIMD forms when a number of computers are activated upon the reading of vetorized data tuples, tuples of the same prefix, thereby triggering the same program segment existing on these computers. A coarse grain MIMD processor forms when
different program segments are activated at the same time upon the reading of different data tuples. The output tuples of every computer automatically chain the acquiring computers into processor pipelines. If the application operates on a continuous flow of a data stream, the
processor pipelines stabilize to provide an additional acceleration for the inherently sequential parts of the application.
FIG. 7 illustrates the formation of a coarse grain MIMD processor. In FIG. 7, with computers 100, 101, 102 coupled to the CLM backbone 120, computers 100, 101 execute program segments SI and S2, respectively, in parallel as MIMD, followed by computer 102 executing segment S3. Note that all program segments in this example use the EXCLUSIVE-READ operation since the original dataflow does not have the sharing of data elements.
FIG. 8 illustrates the formation of coarse grain SIMD processor. In FIG. 8, with computers 100, 101, 102 coupled to the CLM backbone 120, computers 100, 101, 102 execute program segments Si and S2 in parallel as SIMD. Computer 100 exclusively reads tuple a, so segment SI is executed only once on 100. Segment SI generates 1000 vectorized tuples labeled as 'd1' to 'd1000', and triggers at least three computers (waiting for 'd*'s) to run in parallel in an extended version of conventional vector processors; i.e. an SIMD component. S3 may only be activated in one of the three computers due to EXCLUSIVE-READS.
Similarly, a set of sequentially dependent program segments executed by the computers of the CLM system may form a coarse grain processor pipeline. If the data tuples continue to be input, the pipeline may contribute to a potential speedup of computations.
Multiple parallel segments may execute at the same time and compete for available data tuples. Multiple backbone segments can transmit simultaneously. Labeled tuples may circulate in the backbone until either consumed or recycled.
The unidirectional high speed CLM backbone 120 fuses the active data tuples from local memories into a rotating global memory. The CLM backbone 120 circulates only active data tuples required by the processing elements at all times. Thus, if the overall processing power is balanced with the backbone capacity, the worst tuple matching time is the maximum latency of the backbone.
Using the system and method of the present invention, the multicomputer scheduling problem is transformed into a data dependency analysis problem and an optimal data granule size problem at the expense of redundant local disk space or memory space.
The parallelization of MIMD and pipeline segments is a natural product of the program dependency analysis. The parallelization of SIMD segments requires dependency tests on all repetitive entities (loops or recursive calls). For example, any upwardly independent loop is a candidate for parallelization. All candidates are ranked by the
respective computing density of each candidate, with the highest priority being assigned to a unique loop being both dense and independent. If the number of CLM processors is greater than the number of the interactions of the highest prioritized loop, then a next priority loop may be
parallelized; otherwise, the inner loops need not be
parallelized. The parallelized loop uses a variable data granule size. Its adjustments affect the overall computing versus communications ratio in this loop's parallelization. A heuristic algorithm has been developed to adjust this size dynamically to deliver reasonable good performance. The best ratio gives an optimal performance. Recursive
functions and procedures are parallelized using linear subtrees with a uniformly controlled height. Adjusting the height can affect the computing and communication density; thus improving the balance of computing loads. In the above described manner, any sequential program may automatically distribute and parallelize onto the CLM architecture.
The CLM compilation is independent of the number of computers in a CLM system. A CLM compiler generates codes for execution regardless of the number of computers.
Traditional parallel compilation difficulties; i.e. optimal partition, processor allocation, vector and pipeline
constructions are decreased by letting active data tuples form the most effective pipelines and vectors automatically through the EXCLUSIVE-READ protocol over the CLM backbone, thus greatly improving the scheduling efficiency and
programmability of multi-computer systems.
Since each computer processes at most one program segment at a time, a malfunction of one computer damages at most the work being done by one program segment, with the damage being reflected by lost data tuples. Detection of the lost data tuples by using time-outs and by re-issuing respective data tuples can restart the lost computation on a different processor to provide fault tolerance.
The best possible performance of a CLM system is determined by the proper adjustments scheduling factor value (0<f<=1) for vactorizable computing intensive segments.
Application's natural structure also has an impact on performance. For example, an application having many duplicably decomposable segments performs better than the applications having few duplicably decomposable segments. Applications with wider range of data granule sizes would perform better than applications with narrow data granule ranges. The impact is most obvious when the CLM processor powers vary widely.
With the best possible program decomposition, the CLM backbone capacity and the aggregate computer power also limit the maximum performance deliverable by a CLM system for an application. The theoretical power limit of a CLM may be modeled by the following equation:
P = min { ( ∑ Pi ) , CD · min { DD , K · R ) }
where:
P is the CLM theoretical performance limit in MFLOPS; CD is the average computation density of an application in floating point operations per byte (FLOPB); DD is the average data density of an application; i.e. total input, output, and intermediate data in millions of bytes preferably transmitted in one time unit (second);
K is the number of parallel rings, with K = 1, 2, ... 64;
R is the CLM ring capacity in MBPS; and
Pi is the individual CLM processor power in millions of floating point operations per second (MFLOPS).
For example, in a CLM system having a single 10 MBPS ring, then R = 10, and having 10 processors with an average Pi = 20 MFLOPS, a computation with CD = 100 FLOPB and DD = 1 MBPS has a
P = min {20×10, 100xmin{l, 1×10}} = 100 MFLOPS performance, with the performance being limited only by the transactions of the computation and the data densities. For a different application, if CD = 1000 FLOPB, this CLM does not deliver 10 times more MFLOPS for the computation, it can only deliver at most
P = min{20×10, 1000 × min{1, 1×10}} = 200 MFLOPS. The processors are the bottleneck to the computation.
Similarly, if DD is greater than 10 MBPS, then the ring becomes the bottleneck of the computation.
With a 5 μs per kilometer delay and a 147 meter
interval between computers, a distributed CLM system may include up to 10,000 computers with a maximum round trip latency less than 8 seconds. With 100 ns per computer delay, a 32,000 computer-centralized CLM system has a 3.2 millisecond maximum round trip latency. Any application requiring more than 10 seconds of computing time can benefit from either CLM.
The CLM software architecture includes a compiler and an operating system extension, and, in the preferred
embodiment, every computer of the CLM system runs a multiprogrammed operating system, such as Unix(tm), VMS(tm) and OS2 (tm). The CLM compiler analyzes the data dependency and embedded parallelism; i.e. the sub-program duplicability, of a given sequential program. The CLM compiler generates a set of partitioned program segments and a program-segment-to-data-tuple mapping. The CLM operating system extension uses the mapping to provide a fast searchless tuple matching mechanism.
Abstractly, a CLM system is a data reduction machine that uses parallel processors to reduce the natural dataflow graph of a computer application. The embedded parallelism of the application is defined by the CLM compiler. The CLM architecture uses space redundancy to trade-off scheduling efficiency, programmability and fault tolerance in the present invention .
CLM Compiler
The CLM compiler is a sequential program re-structurer; i.e. the CLM compiler takes a sequential program description as an input in a specific programming language; for example, the C programming language. In use, the CLM compiler generates a set of distributable independent program
segments in a desired language; and generates a data-tuple-to-program-segment matching table (DTPS_TBL). The
distributable program segments and the DTPS_TBL are
transmitted to all participating computers of the CLM system before the execution of the application program. The local CLM operating system extensions load the DTPS_TBL, and prepare each computer for a subsequent RUN command.
FIG. 9 illustrates the relationship between the CLM compiler 555 and the CLM operating system (O/S) extension 560.
The CLM 'run X' command sends application X's data tuples to all CLM processors through the O/S extension 560. The compile-link 580 means that the CLM compiler can
optionally load the compiled programs 570-575 and DTPS_TBL onto CLM processors 570 immediately after compilation or leave the loading done by the RUN command.
A Force Copy (FC) directive is introduced to guide the CLM compiler to reduce the compilation time by skipping the dependency analysis. The FC directive has the following syntactical form:
FC { STATEMENT }
For example,
FC { WHILE (! (X < Y))
{ X = SIN (Y) * 250.33;
OOO
}
}
Using C as the host programming language, the CLM compilation command at the operating system level has the following syntax:
CLCC SOURCE_PROGRAM_NAME
(-D DATA DENSITY THRESHOLD)
(-G DATA GRAIN SIZE)
(-P PIPELINE STAGE SIZE)
(-F FACTORING VALUE, 0<F<=1, DEFAULT 0.5)
(-T Factoring threshold, mandatory if F>0)
(-R RECEIVE TIME OUT VALUE IN SECONDS FOR RUNTIME FAULT DETECTION)
(-V DEPTH OF VECTORIZATION, WITH DEFAULT = 1)
The -D option designates a data density threshold for determining the worthiness of parallelization of repetitive program segments. The -G option designates a grain size value for optimizing the difference. The -P option
designates a pipeline stage size to guide the CLM compiler
555 in generating the sequential program segments to improve the performance if the intended program operates on a continuous stream of data. The -F option activates fault tolerance code generation. The -R option may impact the overall performance due to the extra overhead introduced in fault detection and recovery. The time-out value of -F may range from microseconds to 60 seconds. The -V option controls the depth of vectorization. A greater value of V may be used to obtain better performance when a large number of CLM processors are employed. The -F option defines the factoring percentage used in a heuristic tuple size
estimation algorithm. The default value is 0.5. CLM compiler commands for other programming languages may have similar syntax structure.
FIGS. 10A and 10B illustrate the CLM compiler operating principle. Every sequentially composed program, regardless the number of sub-programs, can be transformed into one linear sequence of single instructions. The repetitive segments, iterative or recursive, indicate the natural partitions of the program. Every partitioned segment can be converted into an independent program by finding input and output data structures of the partitioned segment. The repetitive segments may be further parallelized using the SIMD principle. For example, for an iterative segment, independent loop instances are considered duplicable on all CLM processors. Then a data vectorization is performed to enable the simultaneous activation of the same loop core on as many as possible CLM processors, i.e., dynamic SIMD processors. For a recursive segment, a recursive state tuple is constructed to force the duplicated kernels to perform a breadth-first search over the implied recursion tree.
In operation, the CLM compiler 555 operates in the following phases: Data Structure Initialization, Blocking, Tagging,
L-block Vectorization, R-block Vectorization, and Code
Generation. 1. Data structure initialization: if the input specification is syntactically correct, the data structure initialization phase builds an internal data structure to store all information necessary to preserve the semantics of the sequential program, including:
a. data structures and types;
b. procedures;
c. functions; and
d. sequential statements.
A linear list of statements for the main program body is the primary output of the data structure initialization phase. The list, all procedures and functions have been substituted in line.
2. Blocking: the blocking phase scans the linear statements and blocks the linear statements into segments according to the following criterion:
a. All consecutively sequentially connected statements are grouped as one segment, including all
assignments, conditional statements, and non-recursive procedures and functions.
b. All statements included in a loop statement are grouped as one segment. Only the out-most loop is considered for grouping, and all inner-loops are considered as simple statements in the blocking phase.
c. All statements included in a recursive function are grouped as one segment.
d. All statements included in a recursive procedure are grouped as one segment. 3. Tagging: the tagging phase analyzes and tags each segment accordingly:
S - for sequential segments;
L - for loop segments; and
R - for recursive functions or procedures.
4. L-Block Vectorization: the L-Block vectorization phase analyzes the duplicability of each L block, and performs the following tasks:
a. Identify all nested loop within each L block and label each nested loop.
b. Loop maximization: the L-Block vectorization phase may optionally attempt to order the number of
iterations of all loops to form a descending sequence by flipping the independent loops.
c. Perform the following tasks for each loop, beginning from an outer to an inner loop:
i. If a loop contains file input/output statements, mark the loop as non-duplicable and exit the current task;
ii. If a loop is a WHILE loop, translate the
WHILE loop into a FOR loop, if possible. Otherwise, if the loop is not a FC block, mark the loop as non-duplicable and exit the current task;
iii. If either lower or upper bound of the loop is a variable and if the loop is not a FC block, then mark the loop as non-duplicable and exit the current task;
iv. Conduct a dependency test, and mark the loop as non-duplicable if at least one statement in the loop body is dependent on the execution results for a previous loop. Anti-dependency is considered duplicable;
v. Calculate the computational density D of the L block by counting the number of fixed and floating point operations, including the inner loops. If D is less than a threshold specified by the -D option at the
compilation command line, and if the block is not a FC block, then mark the L block as non-duplicable and exit the current task; and
vi. Mark the L-block duplicable and exit the current task.
d. Starting from the outermost loop, mark the Vth loop as an SIMD candidates, with V specified at
compilation time and with a default = 1.
5. R-Block Vectorization: the R-block vectorization phase is responsible for analyzing the duplicability of recursive functions and procedures by performing the
following tasks:
a. Identify all nested procedures and functions and label the nested procedures and function.
b. Perform the following task for each nested recursive procedure or function:
i. If at least one of the call parameters is updated in the function or procedure body, then mark the nested procedure/function as non-duplicable and exit the current task;
ii. Calculate the computation density D of the R-block by adding the number of fixed/float point operations per recursive execution. If D is less than a threshold specified by the -D option at the compilation command line, then mark the R-block as non-duplicable and exit the current task; and
iii. Mark the R-block duplicable and exit the current task.
6. Code Generation: the code generation phase
furnishes the decomposable program segments into
independently compilable programs:
a. S-Block processing is performed for enhancing the efficiency of the potential processor pipelines if the intended program is to operate on a continuous stream of data, as specified by the presence of the -P option at compilation command. Each S-block is inspected for the total number of statements contained in the S-block, and all S-blocks are to be spliced into segments less than or equal to the value given by the -P option.
b. Formalize the tuple requirements for each S/L/R block by the following steps:
i. Collect all data structures required by the statements of each block, including all global data structures and parameters if the block is a
function/procedure block;
ii. Collect all data structures returned by each block. The returned data structures include all global data structures referred on the left-hand-side of all statements in the block and the modified parameters returned from functions/procedure calls; iii. Formulate tuples: the formulation of tuples process starts from the output of each block. For each block, the optimal number of output tuple equals to the number of sequentially dependent program segments, thereby preserving the original parallelism by retaining the
separation of data tuples;
iv. Assign names to all program segments, with the assigned names being the sequential program name plus the block type plus a block sequence number.
v. Assign tuple names: every output tuple is assigned a tuple name. The tuple name consists of the sequential program name, the block name, and the first variable name in the collection of output variables. The assigned names are to be propagated to the inputs of all related program segments;
vi. Exclusive-read treatment: for every data tuple read by each block,
(a). If the data tuple is to be
read by at least one other
block, and never modified by
any block, mark the reading of the data tuple as non- exclusive,
(b). If the data tuple is to be
read only by a block, mark the reading of the tuple
exclusive,
(c). If the data tuple is to be read and modified by many
blocks. For example, variable
X is modified by three blocks:
S1, S2 and S3. If the
<S1,S2,S3> is the original
sequential order, then
introduce new variables X_1,
X_2 and X_3. X is to be
EXCLUSIVE-READ by SI. The
modification will be stored in
X_1. X_1 will be EXCLUSIVE- READ by S2 with updated value in X_2. The similar process
happens to S3. X_1, X_2 and
X_3 assume the same definition as X. FIGS. 11A and 11B
illustrate tuple update
serialization involving
multiple blocks.
c. L-block slicing: each SIMD each L-block is further spliced into two sections: a master section and a worker section. The master section contains statements for scattering the vectorized data tuples and for gathering the resulting data tuples. The globally referenced data structures in worker section are structured as READ tuples in the worker code. The vectorized tuples are scattered in G groups, where G is calculated according to the following loop scheduling algorithm, developed based on "Factoring - - A Method for Scheduling Parallel Loops," by S.F. Hummel, E. Schonberg and L.E. Flynn, Communications of ACM, Vol., No. 8, 90-101, August 1992. The disclosing loop scheduling algorithm was developed based on the algorithm published in, "A Virtual Bus Architecture for Dynamic Parallel
Processing," by K.C. Lee, IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 2, 121-130, February 1993. K.C. Lee discussed a modular time-space-time network
comprising multiple time division networks connected by a nonblocking space division network (switches). Except for the exclusive-read implementation, the rest is compatible with the multiple backbone CLM architecture.
P : total number of processors
N : total number of iterations
f(0<f<=1): the scheduling factor, default 0.5.
T : the termination threshold
Rj : the jth remaining iterations
Sj : jth tuple size (G)
The tuple size G(Sj's) is defined as follows:
R0=N,
Rj+1 = Rj - P*Sj = (1-F)*Rj, or
Sj = Rj * F/P
until Sj<T.
The value of P can be automatically determined by the CLM backbone management layer and transmitted to the program at execution time. The value T is from the -T option at the compilation command line. The worker section is padded with statements for retrieving tuples in the beginning of the worker section, and with statements for packing the results into new tuples at the end of the worker section. The new spliced segments are distinguished by suffixes M and W respectively.
As shown in the flowchart in FIG. 12, the master section performs the reading and getting 200 all the
necessary data tuples according to initial data
dependencies; sending 205 globally read-only tuples to the backbone for the workers; calculating 210 the vectorized data tuples according to the factoring algorithm described above; packing 215 data according to Sj's; attaching 220 to each packed tuple (T,Sj,#) where T is the partition
threshold value, Sj is the tuple size and # is the packing sequence number amongst N iterations; looping 225 for i = 1 to G to put 230 the packed data tuples to the
backbone; looping 240 for 1 = 1 to N to get 245 the result from the backbone and unpack 250 the results from the backbone; if the fault tolerant option R is greater than zero, then checking time out 255; otherwise the master terminates. The fault tolerance option includes the steps of letting 260 K be the number of unreceived result tuples which is calculated from the differences between scattered data tuples and received results; looping 270 for i = 1 to k to re-put 275 data tuples to the backbone; looping 280 for i = 1 to k to get 285 and unpack 290 the result tuples; and looping 255 to continue the fault tolerance option if there are more time-outs. Otherwise, the master section ends 300.
As shown in the flowchart in FIG. 13, the worker section performs the reading 310 of global data tuples;
getting 320 a vectorized tuple; unpacking 325 the received tuple; retrieving 330 (T,Sj,#) from the tuple; computing 335; packing 340 the results; packing Sj pieces of
vectorized data, 335 (T,Sj,#) to the result tuple; putting 350 the result tuple to backbone; checking 355 if Sj>T then loop back to get a vectorized tuple 320; otherwise ending 360.
d. R-block tuplization by tree slicing: each duplicable R-block is spliced into two sections: the master section and a worker section. The master section contains statements for generating an initial state tuple and a live tuple count TP_CNT, with TP_CNT initially set to 1. The initial state tuple is a collection of all data structures required for subsequent computation. The master section also has statements for collecting the results. As shown in FIG. 14, the master section operates using the procedure of assembling 365 a state tuple from the collection of all read-only dependent data structures in the R-block, with the state tuple
ST_TUPLE = (G, d1, d2. ..., dk) where G is the grain size given at compilation time using the -G option. The master section also creates a result tuple using the globally updatedd data structures and return data structures via syntactical analysis of the recursion body. It puts the result tuple into the backbone. The master section generates 370 a live tuple count TP_CNT; sets 375 the live tuple count TP CNT to 1; puts 380 the state tuple and the live tuple count to the CLM backbone 120; gets 385 a term tuple from the CLM backbone 120; gets 390 a result tuple from the CLM backbone 120; unpacks 395 the result tuple; and returns 400 the unpacked results.
As shown in FIG. 15, the worker section gets 405 the state tuple ST_TUPLE from the CLM backbone 120; unpack 410 the state tuple calculates 415 using the state tuple;
generate 420 a search tree of G levels. The worker section also updates 425 the result tuple during the calculation via exclusive-reads and writes to the backbone. It then creates 440 N new state tuples according to the results of the calculations. It gets 445 TP_CNT from the backbone and sets it to TP_CNT + N -1. If TP_CNT becomes 455 zero, then it creates 460 a "term" tuple and puts it into the backbone. Otherwise, it puts 475 N new state tuples into the backbone and loops back to the beginning.
e. Perform fault tolerance treatment of the L-blocks, R-blocks, and S-blocks by the following.
a) for each program segment, in addition to its own reads, add data tuple reads to all predecessor's read tuples and
EXCLUSIVE-READ tuples with a 'R'-prefix (the rescue-tuples)
b) for each program segment add data tuple writes immediately, with a 'R'-prefix, after reading the input tuples.
c) add a time-test code segment to every input tuple read. The time-out code writes the corresponding rescue-tuple to the CLM backbone. Then goto the tuple read again,
f. EXCLUSIVE-READ deadlock prevention, with the prevention performed by processing the multiple-EXCLUSIVE-READ-blocks. For every block having K exclusive-read input tuples, with K > 1, implement the protocol as shown in FIG. 16, where a count CNT is set 500 to equal 1; input tuples are read 505 from the CLM backbone 120; check 510 if the input tuple is to be read, as opposed to being exclusively read. If the input tuple is to be read, the input tuple is read 515. Otherwise, check 520 if the input tuple is to be exclusively read and if the count CNT is less than K. If the input tuple is to be exclusively read and count CNT is less than K, then the input tuple is read 525 and the count is incremented by setting 530 CNT equal to CNT + 1.
However, if the input tuple is to be exclusively read but the count CNT is greater than or equal to K, the input tuple is exclusively read 535. The procedure illustrated in FIG. 16 prevents possible deadlocks when K exclusive tuples are acquired by L different program segments on L computers, with L > 1, with the deadlock resulting in no exclusive tuples progressing.
g. Map generation: For each program segment, generate the initial input tuple names and mark the initial input tuple names with RD to be read, or ERD to be
exclusively read. For spliced L and R segments, the initial input tuples in the map should NOT contain the tuples belonging to the gathering parts, such as 245 and 285 in FIG. 12 and 385 and 390 in FIG. 14. These tuples are to be enabled after the completion of the scattering part. Store the information to a file having a name in the form of
APPLICATION NAME + MAP.
CLM Operating System Extension
With each processor running multi-programmed operating system kernel; for example, UNIX™, VMS™ or others operating systems, the CLM O/S extension 560, as shown in FIG. 17, contains the following elements:
1. Event Scheduler 585, being a program using CLM extended TCP/IP to interpret the network traffic. The Event Scheduler 585 branches into four different servers: a
Program Management Server (PMS) 610, a Data Token Server (DTS) 615, a General Process Management Server (GPMS) 620, and a Conventional Operating System Interface (COSI) 625 upon receipt of every network message. Note that network messages must include the following groups:
a) Program execution command (point-to-point and
broadcast): the RUN command.
b) Data token traffic.
c) General process management commands: KILL,
SUSPEND, RESUME, STATUS and others.
d) Conventional operating system commands (point-to- point and broadcast), such as remote file access, terminal display, etc.
2. Program Management Server (PMS) 610 for storing. activating, and removing partitioned CLM program segments. The program storage and removal functions act as a simple interface with existing O/S file systems. After receiving an activation command or a RUN command for an application program, the PMS builds a memory image for every related segments:
Trigger Tuple Name Table, (extracted from
DTPS_TBL, 127 entries);
Disk address of the segment, (1 entry);
Data Area*(v); and
Instruction Area*(^).
If the local memory is limited, the segment image can contain only the Trigger Tuple Name Table and The Disk Address entry. Similar to a demand paging concept, the segment with a matching tuple is fetched from local disk to local memory. The trigger tuple name table size is
adjustable at system generation time. It also creates an indexed table (MTBL) from DTPS_TBL to the newly created segment images according to the referenced tuple names.
3. Data Token Server (DTS) 615, for matching the received data tuple, as a data token, with the DTPS_TBL. Whenever a backbone match is made, it marks all related program segments' trigger tuple name table entries using MTBL. If a complete trigger is found for any program segment, computer control is handed to the respective program segment.
4. General Process Management Server (GPMS) 620, for managing the KILL, SUSPEND, and RESUME commands, as well as general process status (GPS) commands for the CLM processes.
Since there is only one segment executing at any time on the local processor, this is a simple interface with the conventional operating system using the information from DTPSJTBL.
5. Conventional Operating System Interface (COSI) 625, for intercepting all commands of the conventional operating system 585, and relays the convention operating system commands to the running kernel.
The present invention uses a method for executing an application program including a plurality of program
segments, with each program segment labeled according to data dependencies and parallelism among the program
segments, and with each program segment connected to other programs segments by corresponding labeled data tuples. As shown in FIG. 18, the method includes the steps of receiving 1040 an execution command; transmitting 1045, on the CLM backbone 120, the labeled program segments to the plurality of computers; loading 1050 the labeled program segments into each computer of the plurality of computers; transmitting 1055, on the CLM backbone 120, the labeled data tuples to the plurality of computers; receiving 1060, at a receiving computer, a labeled data tuple; activating 1065, at the receiving computer, in response to the labeled data tuple, the program segments corresponding to the labeled data tuple; processing 1070, at the receiving computer, the activated program segments; transmitting 1075, on the CLM backbone 120, the results of the processing of the program segments as labeled data tuples; continuing 1080 to
transmit, on the CLM backbone 120, the labeled data tuples to the plurality of computers; and continuing 1085 to process, at the plurality of computers, the program segments corresponding to the labeled data tuples until the entire application program is processed.
CLM IMPLEMENTATION
This section exhibits a conceptual implementation of CLM and its possible variations.
FIG. 19 illustrates CLM Processor and Single Backbone
Interface. There are three independent processes: Read, Write and CPU. The shifting register holds the rotating messages. The message header indicates the nature of the message. There can be five types of messages:
0: empty slot
1: data tuple
2: data tuple reset signal
3. program management
4: conventional operating system commands
The S bit indicates the availability of register to the
WRITE process. It may be implemented either on the local circuit or as part of the message. The Shift_Clock controls the backbone rotating frequency. The shifting register can hold typically 1024 bytes of information. The programmings (or protocols) of the three processes are as follows:
Read:
If Global_clock=on /"Read/Write period*/
Figure imgf000033_0001
Figure imgf000034_0001
Note that the Read process handles the following special cases: a) Purge of returned messages. This protocol checks the register content against BUFFER_OUT. If there is a match and if the message is not EXCLUSIVE-READ or the message is not the last packet in an EXCLUSIVE-READ message, then the content in BUFFER_OUT is purged. This empties the message slots. A returned EXCLUSIVE-READ message (or the last packet of the message) will keep circulating until it is consumed. b) Tuple name matching. A data tuple in Register contains an index generated by the compiler. This index is also recorded in the DTPS_TBL and local TTNTs. A simple test using the index in the local TTNTs can determine the
existence of a match. All read tuples are copied to
BUFFER_IN. c) EXCLUSIVE-READ deadlock avoidance: When more than one CPU exclusively read one of many tuples belonging to the same program segment, or more than on CPU exclusively read some of the packets belonging to the same tuple, none of the CPUs will be ever completely matched. The deadlock is avoided by only exclusively read the last EXCLUSIVE-READ tuple or the last packet of an EXCLUSIVE-READ tuple in a TTNT.
FIG. 20 illustrates CLM with three Processors and A Single Backbone. In this structure, point-to-point,
broadcast and exclusive-read of any message can be performed by any processors. When BUFFER_OUTS on all processors are full, the system enters a "lock state". The 'lock state' may be automatically unlocked when some of the CPUs become available for computing. The system enters a "deadlock state" when all CPUs are blocked for output and no empty slot is available on the backbone.
There are two solutions to this condition:
a) Expand the BUFFER-OUT sizes. This can postpone the occurrence of the lock state; and
b) Increase the CPU powers or the number of CPUs. This can be used to make deadlock-free CLMs. When CPUs are idling most of the times, the backbone is the bottleneck.
FIG. 21 illustrates the CPU-to-backbone interface for a CLM system with two backbones. Both READ and WRITE
processes must dispatch the messages from the to multiple backbones. The size and number of buffers should be
adjusted to accommodate the increased bandwidth. The dispatch algorithm and the interface hardware implementation must assure that the malfunction of any subset of backbones will not bring down the system [R3].
The backbone initialization command sets all message (register) headers to 0.
The number of backbones affects the cost, total
performance and fault tolerance degree. The final design is application dependent.
Under this simple design, processors are not guaranteed to share the backbone in a "fair" fashion - - the closer neighbors of the sender will be busier than those are further away. In general, this should not affect the overall CLM performance since when all the closer neighbors are busy the further neighbors will get something to do eventually.
A restriction of this simple design is that the
backbone is strictly used for data communication. It cannot be used for high quality point-to-point multi-media signal transmission since the simple protocols do not maintain virtual channels. The use of the HEFLAN protocol, U.S.
Patent Filing Number: 08/029,882, A MEDIUM ACCESS CONTROL PROTOCOL FOR SINGLE-BUS MULTIMEDIA FAIR ACCESS LOCAL AREA NETWORKS, by Zheng Liu, can give both the fairness property and the multi-media capabilities.
The use of existing telecommunication technologies can implement CLMs using local and long haul networks. The shifting registers can be replaced by any existing
networking system that has a ring compatible topology, i.e.. Token Ring (IEEE 802.4), DQDB (IEEE) 802.6), Star-Connected ATM LAN (Fore Systems) and HEFLΑN provided that they must be modified to provide the implementation of the "EXCLUSIVE-READ" function (similar to the above Read protocol.
The present invention also demonstrates a feasible design of a single backbone and a multiple-backbone CLM system. The disclosed protocols illustrate the principles for implementing:
a) Point-to-point, broadcast and EXCLUSIVE-READ message b) Message slot recycling
c) Local buffer recycling
d) Exclusive-read lock prevention
e) CLM program management and local operating system command processing.
These principles must be used in addition to the use of other computer engineering and data communication
techniques for the construction of practical CLM systems.
The present invention automatically partitions any sequential program into program segments and uses a method for executing an application program including a plurality of inter-relating program segments, with each program segment labeled according to data dependencies and
parallelism among the program segments, and with each program segment connected to other programs segments by a corresponding labeled data tuple. As shown in FIG. 18, the method includes the steps of receiving 1040 an execution command; transmitting 1045, on the CLM backbone 120, the labeled program segments to the plurality of computers;
loading 1050 the labeled program segments into each computer of the plurality of computers; transmitting 1055, on the CLM backbone 120, the labeled data tuples to the plurality of computers; receiving 1060, at a receiving computer, a labeled data tuple; activating 1065, at the receiving computer, in response to the labeled data tuple, the program segments corresponding to the labeled data tuple; processing 1070, at the receiving computer, the activated program segments; transmitting 1075, on the CLM backbone 120, the results of the processing of the program segments as labeled data tuples; continuing 1080 to transmit, on the CLM
backbone 120, the labeled data tuples to the plurality of computers; and continuing 1085 to process, at the plurality of computers, the program segments corresponding to the labeled data tuples until the entire application program is processed.
It will be apparent to those skilled in the art that various modifications may be made to the multicomputer system and method of the instant invention without departing from the scope or spirit of the invention, and it is intended that the present invention cover modifications and variations of the multicomputer system and method provided they come within the scope of the appended claims and their equivalents.

Claims

I CLAIM:
1. A connectionless machine (CLM) multicomputer system comprising:
a unidirectional slotted ring;
a plurality of computers, each computer connected to said unidirectional slotted ring, each computer having at least 0.5 MFLOPS processing power, each computer having a local real memory and an operating system supporting a virtual memory and multiprogramming;
wherein at least one of said plurality of computers, responsive to receiving from said unidirectional slotted ring, an initial data tuple of a program segment, transmits onto said unidirectional slotted ring an
EXCLUSIVE-READ signal for informing said plurality of computers of reception of the initial data tuple of the program segment; and
wherein said plurality of computers, responsive to receiving the EXCLUSIVE-READ signal, ceases contending for data tuples corresponding to the program segment.
2. The CLM multicomputer system as set forth in claim
1 wherein at least one computer includes means for
decomposing sequentially composed application program into a plurality of program segments, forming an acyclic graph.
3. The CLM multicomputer system as set forth in claim
2 wherein each computer including decomposing means further includes: means for analyzing each program segment of the decomposed application program; and
means, responsive to analysis of the decomposed program segment for converting each decomposed program segment for single-instruction-multiple-data (SIMD)
processing.
4. The CLM multicomputer system as set forth in claim
3 wherein each computer including decomposing means further includes:
a single-language to sequential-programming language extension;
a parallel compiler, responsive to force copy from the single-language to sequential-programming languages extension, for skipping dependency analysis and for
vectorizing designated repetitive program segment.
5. The CLM multicomputer system as set forth in claim
4 wherein each computer including decomposing means further includes means for converting each decompose program segment into labeled tuple driven format.
6. The CLM machine as set forth in claim 5 wherein each computer including the decomposing means further includes means for duplicating tuple driven segments onto each connected computer for allowing automatic formation of course grain SIMD, multiple-instruction-multiple-data and pipelined processors at run time.
7. The CLM machine as set forth in claim 6 wherein each computer including decomposing means further includes means for automatically detecting faults and means for recovering using fault tolerant processing.
8. The CLM multicomputer system as set forth in claim
7 wherein each computer including decomposing means further includes means for automatically scheduling heterogeneous computers for every SIMD segment by using a modified
factoring scheduling algorithm.
9. The CLM multicomputer system as set forth in claim
8 wherein each computer including decomposing means further includes means for automatically balancing the load of heterogeneous computers for recursive of SIMD segment using a uniformly controlled depth value G.
10. The CLM multicomputer system as set forth in claim
9 further including means for resolving multiple EXCLUSIVE-READ deadlock in use.
11. The CLM multicomputer system as set forth in claim
10 further including means for assigning tuple names and decomposed segment names for multiple decomposed programs running in parallel using multiple computers.
12. The CLM multicomputer system as set forth in claim
11 wherein each computer having decomposing means includes an operating system extension design for carrying out ordinary multiple operating systems activities and
connectionless parallel processing by forming conventional operating system interfaces, a connectionless data token server, a connectionless-program initiator, a connectionless process manager.
13. The CLM multicomputer system as set forth in claim
12 wherein each computer including a decomposing means further includes means for calculating a theoretical
performance of a connectionless parallel multicomputer.
14. The CLM multicomputer system as set forth in claim
13 wherein each computer including a decomposing means includes a protocol for the READ process for each computer for implementing the connectionless multicomputer system.
15. The CLM multicomputer system as set forth in claim
14 wherein each computer having decomposing means further includes a protocol for the WRITE process for each computer for implementing the connectionless multicomputer system.
16. The CLM multicomputer system as set forth in claim
15 wherein each computer including decomposing means further includes a protocol for the CPU process for each computer for implementing the connectionless multi-computer system.
17. A method using a connectionless machine (CLM) multicomputer system using a plurality of computer
comprising the steps of:
sending a plurality of data tuples along a
unidirectional slotted ring;
transmitting, from at least one of said plurality of computers, in response to receiving from said
unidirectional slotted ring an initial data tuple of a program segment, onto said unidirectional slotted ring an EXCLUSIVE-READ signal for informing said plurality of computers of reception of the initial data tuple of the program segment; and
ceasing by said plurality of computers, in
response to receiving the EXCLUSIVE-READ signal, contending for data tuples corresponding to the program segment.
18. The method using the CLM multicomputer system as set forth in claim 17 wherein at least one computer includes the step of decomposing a sequentially-composed application program into a plurality of program segments, thereby forming an acyclic graph.
19. The method using the CLM multicomputer system as set forth in claim 18 wherein each computer performing the decomposing step further includes the steps of:
analyzing each program segment of the decomposed application program; and
converting, in response to analysis of the decomposed program segment, each decomposed program segment for single-instruction-multiple-data (SIMD) processing.
20. The method using the CLM multicomputer system as set forth in claim 19 wherein each computer performing the decomposing step further includes the steps of skipping in response to force copy from the single-language to
sequential-programming languages extension, dependency analysis and vectorizing designated repetitive program segment.
21. The method using the CLM multicomputer system as set forth in claim 20 wherein each computer performing the decomposing step further includes the step of converting each decompose program segment into labeled tuple driven format.
22. The method using the CLM machine as set forth in claim 21 wherein each computer performing the decomposing step further includes the step of duplicating tuple driven segments onto each connected computer for allowing automatic formation of course grain SIMD, multiple-instruction-multiple-data and pipelined processors at run time.
23. The method using the CLM machine as set forth in claim 22 wherein each computer performing the decomposing step further includes the step of automatically detecting faults and means for recovering using fault tolerant processing.
24. The method using the CLM multicomputer system as set forth in claim 23 wherein each computer performing the decomposing step further includes the step of automatically scheduling heterogeneous computers for every SIMD segment by using a modified factoring scheduling algorithm.
25. The method using the CLM multicomputer system as set forth in claim 24 wherein each computer performing the decomposing step further includes the step of automatically balancing the load of heterogeneous computers for recursive of SIMD segment using a uniformly controlled depth value G.
26. The method using the CLM multicomputer system as set forth in claim 25 further including the step of
resolving multiple EXCLUSIVE-READ deadlock in use.
27. The method using the CLM multicomputer system as set forth in claim 26 further including the step of
assigning tuple names and decomposed segment names for multiple decomposed programs running in parallel using multiple computers.
28. The method using the CLM multicomputer system as set forth in claim 27 wherein each computer performing the decomposing step includes the step of carrying out ordinary multiple operating systems activities and connectionless parallel processing by forming conventional operating system interfaces, a connectionless data token server, a
connectionless-program initiator, a connectionless process manager.
29. The method using the CLM multicomputer system as set forth in claim 28 wherein each computer performing the decomposing step further includes the step of calculating a theoretical performance of a connectionless parallel
multicomputer.
30. The method using the CLM multicomputer system as set forth in claim 29 wherein each computer performing the decomposing step includes the step of implementing the connectionless multicomputer system using a protocol for the READ process for each computer.
31. The method using the CLM multicomputer system as set forth in claim 30 wherein each computer performing the decomposing step further includes the step of implementing the connectionless multicomputer system using a protocol for the WRITE process for each computer.
32. The method using the CLM multicomputer system as set forth in claim 31 wherein each computer performing the decomposing step further includes the step of implementing the connectionless multi-computer system using a protocol for the CPU process for each computer.
33. A method using a connectionless machine (CLM) multicomputer system using a plurality of computer
comprising the steps of:
sending a plurality of data tuples along a unidirectional slotted ring;
transmitting, from at least one of said plurality of computers, in response to receiving from said
unidirectional slotted ring an initial data tuple of a subset of a partitioned program, onto said unidirectional slotted ring an EXCLUSIVE-READ signal for informing said plurality of computers of reception of the initial data tuple of the subset of the partitioned program; and
ceasing by said plurality of computers, in response to receiving the EXCLUSIVE-READ signal, contending for data tuples corresponding to the subset of the
partitioned program.
PCT/US1994/012921 1994-11-07 1994-11-07 Multicomputer system and method WO1996014617A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/US1994/012921 WO1996014617A1 (en) 1994-11-07 1994-11-07 Multicomputer system and method
JP8515266A JPH10508714A (en) 1994-11-07 1994-11-07 Multicomputer system and method
AU11746/95A AU1174695A (en) 1994-11-07 1994-11-07 Multicomputer system and method
EP95902495A EP0791194A4 (en) 1994-11-07 1994-11-07 Multicomputer system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US1994/012921 WO1996014617A1 (en) 1994-11-07 1994-11-07 Multicomputer system and method

Publications (1)

Publication Number Publication Date
WO1996014617A1 true WO1996014617A1 (en) 1996-05-17

Family

ID=22243254

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1994/012921 WO1996014617A1 (en) 1994-11-07 1994-11-07 Multicomputer system and method

Country Status (4)

Country Link
EP (1) EP0791194A4 (en)
JP (1) JPH10508714A (en)
AU (1) AU1174695A (en)
WO (1) WO1996014617A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001014970A2 (en) * 1999-08-25 2001-03-01 Infineon Technologies Ag Event scheduler and method for analyzing an event-oriented program code
GB2374443B (en) * 2001-02-14 2005-06-08 Clearspeed Technology Ltd Data processing architectures
CN102567079A (en) * 2011-12-29 2012-07-11 中国人民解放军国防科学技术大学 Parallel program energy consumption simulation estimating method based on progressive trace update
EP3539261A4 (en) * 2016-11-14 2020-10-21 Temple University Of The Commonwealth System Of Higher Education System and method for network-scale reliable parallel computing
CN113590166A (en) * 2021-08-02 2021-11-02 腾讯数码(深圳)有限公司 Application program updating method and device and computer readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE508415T1 (en) * 2007-03-06 2011-05-15 Nec Corp DATA TRANSFER NETWORK AND CONTROL DEVICE FOR A SYSTEM HAVING AN ARRAY OF PROCESSING ELEMENTS EACH EITHER SELF-CONTROLLED OR JOINTLY CONTROLLED

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5021947A (en) * 1986-03-31 1991-06-04 Hughes Aircraft Company Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing
US5313647A (en) * 1991-09-20 1994-05-17 Kendall Square Research Corporation Digital data processor with improved checkpointing and forking

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5021947A (en) * 1986-03-31 1991-06-04 Hughes Aircraft Company Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing
US5313647A (en) * 1991-09-20 1994-05-17 Kendall Square Research Corporation Digital data processor with improved checkpointing and forking

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP0791194A4 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001014970A2 (en) * 1999-08-25 2001-03-01 Infineon Technologies Ag Event scheduler and method for analyzing an event-oriented program code
WO2001014970A3 (en) * 1999-08-25 2002-08-01 Infineon Technologies Ag Event scheduler and method for analyzing an event-oriented program code
GB2374443B (en) * 2001-02-14 2005-06-08 Clearspeed Technology Ltd Data processing architectures
US8127112B2 (en) 2001-02-14 2012-02-28 Rambus Inc. SIMD array operable to process different respective packet protocols simultaneously while executing a single common instruction stream
CN102567079A (en) * 2011-12-29 2012-07-11 中国人民解放军国防科学技术大学 Parallel program energy consumption simulation estimating method based on progressive trace update
CN102567079B (en) * 2011-12-29 2014-07-16 中国人民解放军国防科学技术大学 Parallel program energy consumption simulation estimating method based on progressive trace update
EP3539261A4 (en) * 2016-11-14 2020-10-21 Temple University Of The Commonwealth System Of Higher Education System and method for network-scale reliable parallel computing
US11588926B2 (en) 2016-11-14 2023-02-21 Temple University—Of the Commonwealth System of Higher Education Statistic multiplexed computing system for network-scale reliable high-performance services
CN113590166A (en) * 2021-08-02 2021-11-02 腾讯数码(深圳)有限公司 Application program updating method and device and computer readable storage medium
CN113590166B (en) * 2021-08-02 2024-03-26 腾讯数码(深圳)有限公司 Application program updating method and device and computer readable storage medium

Also Published As

Publication number Publication date
EP0791194A1 (en) 1997-08-27
EP0791194A4 (en) 1998-12-16
JPH10508714A (en) 1998-08-25
AU1174695A (en) 1996-05-31

Similar Documents

Publication Publication Date Title
US5517656A (en) Multicomputer system and method
US11128519B2 (en) Cluster computing
US5021947A (en) Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing
Kruskal et al. Efficient synchronization of multiprocessors with shared memory
JP2882475B2 (en) Thread execution method
Willcock et al. AM++ a generalized active message framework
Hoefler et al. Towards efficient mapreduce using mpi
Grimshaw et al. Portable run-time support for dynamic object-oriented parallel processing
US8595736B2 (en) Parsing an application to find serial and parallel data segments to minimize mitigation overhead between serial and parallel compute nodes
EP1171829A1 (en) Distributed digital rule processor for single system image on a clustered network and method
JPH08185325A (en) Code generation method in compiler and compiler
EP0791194A1 (en) Multicomputer system and method
EP0420142B1 (en) Parallel processing system
Gaudiot et al. Performance evaluation of a simulated data-flow computer with low-resolution actors
Wrench A distributed and-or parallel prolog network
Shekhar et al. Linda sub system on transputers
Moreira et al. Autoscheduling in a shared memory multiprocessor
Stricker et al. Decoupling communication services for compiled parallel programs
Solworth Epochs
Buzzard High performance communications for hypercube multiprocessors
Arapov et al. Managing the computing space in the mpC compiler
JP3514578B2 (en) Communication optimization method
Barak et al. The MPE toolkit for supporting distributed applications
Athas et al. Multicomputers
Painter et al. ACLMPL: Portable and efficient message passing for MPPs

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU BB BG BR CA FI HU JP KP KR LK MG MN MW NO PL RO SD SE

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
ENP Entry into the national phase

Ref document number: 2204518

Country of ref document: CA

Ref country code: CA

Ref document number: 2204518

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 1995902495

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1995902495

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1995902495

Country of ref document: EP