WO2011105408A1

WO2011105408A1 - Simd processor

Info

Publication number: WO2011105408A1
Application number: PCT/JP2011/053935
Authority: WO
Inventors: 昭倫京
Original assignee: 日本電気株式会社
Priority date: 2010-02-24
Filing date: 2011-02-23
Publication date: 2011-09-01
Also published as: JP5708634B2; JPWO2011105408A1

Abstract

The disclosed SIMD processor enables a program to be processed at high speed using processing elements of the SIMD processor without modifying data arrangement upon global memory. The SIMD processor is provided with a control processor (CP) and a plurality of processing elements (PEs), wherein the plurality of PEs perform an SIMD operation for executing a single instruction issued from the CP, a specific PE among the plurality of PEs performs an instruction/data distribution operation for receiving the instruction and data issued from the CP, and for the instruction that has been broadcast to each PE from the CP in the instruction/data distribution operation, each of the PEs performs a systolic operation for execution after source operands of the instruction have been collected.

Description

SIMD processor

(Description of related applications)
The present invention is based on the priority claim of Japanese patent application: Japanese Patent Application No. 2010-038976 (filed on Feb. 24, 2010), the entire description of which is incorporated herein by reference. Shall.

The present invention relates to a SIMD (Single Instruction Multiple Data) processor.

FIG. 9 is a diagram showing the configuration of the SIMD processor described in Non-Patent Document 1. Referring to FIG. 9, the SIMD processor includes a plurality of computing elements (PE: Processing Element) 110 and a control processor (CP: Control Processor) 130 that issues the same instruction to the plurality of PEs 110. According to the SIMD processor, high computing performance can be realized based on inexpensive hardware.

In the SIMD processor, under the control of the CP 130, the data arranged at a predetermined position on the global memory (Global Memory) 140 managed by the CP 130 is sequentially read into the local memory (Local Memory) 120 on the PE 110 side in the order of addresses. Thereafter, all the PEs 110 perform calculations on the data in their local memory 120 at the same time in accordance with instructions issued from the CP 130.

Note that Patent Document 1 describes an image processor that can switch between a SIMD type and a systolic array type configuration.

JP 2008-034953 A

In order to increase the speed by parallel processing by porting a program to a SIMD processor, it is generally necessary to change the data arrangement on the global memory 140 so that it is convenient to read the data on the global memory 140 into the PE array. is there. However, in order to change the data arrangement on the global memory 140, it may be necessary to redesign the algorithm, which is a heavy burden on the program developer. This is a factor that hinders the use of the SIMD processor.

In order to facilitate the work of changing the data arrangement on the global memory 140 by the program developer, instead of having each PE 110 locally have a memory block like the SIMD processor shown in FIG. 9, a dynamically reconfigurable processor As described above, a method of connecting a large number of memory blocks and a large number of PEs with abundant wiring (for example, crossbars) can be considered. However, according to this method, since it is necessary to add a large number of wiring circuits to the SIMD processor, hardware for realizing the SIMD processor becomes expensive.

From the above, in order to port a program to a SIMD processor based on an inexpensive hardware configuration, it is necessary to change the data arrangement on the global memory 140. To change the data arrangement on the global memory 140, Since the algorithm needs to be changed, it becomes a heavy burden on the program developer, and the use of the SIMD processor is prevented.

Therefore, there is a problem in that it is possible to process a program at high speed by utilizing the arithmetic element of the SIMD processor without changing the data arrangement on the global memory. An object of the present invention is to provide a SIMD processor that solves such problems.

The SIMD processor according to the first aspect of the present invention is:
A SIMD (Single Instruction Multiple Data) processor comprising a control processor (CP: Control Processor) and a plurality of computing elements (PE: Processing Element),
The plurality of PEs perform a SIMD operation to execute a single instruction issued from the CP,
The CP performs a command / data distribution operation for distributing different commands or different commands and data to each of the plurality of PEs,
Each of the plurality of PEs performs a systolic operation of executing an instruction sent from the CP in the instruction / data distribution operation after the source operands of the instruction are prepared.

According to the SIMD processor according to the present invention, it is possible to process a program at high speed by utilizing the arithmetic elements of the SIMD processor without changing the data arrangement on the global memory.

It is a figure which shows the structure of 1PE in the PE array contained in the SIMD processor which concerns on embodiment. It is a figure which shows the structure of the control processor (CP) contained in the SIMD processor which concerns on embodiment. It is a figure which shows the command format in the SIMD processor which concerns on embodiment. It is a figure which shows the structure of the SIMD processor which concerns on embodiment as an example. It is a figure which shows the mode of the data transfer between PE in a SIMD processor. It is a figure for demonstrating an Example. It is a figure for demonstrating an Example. It is a figure which shows the pseudo code in an Example. It is a figure which shows the structure of the conventional SIMD processor as an example. It is a figure which shows the structure of 1PE in the PE array contained in the conventional SIMD processor.

According to the first development form, the SIMD processor according to the first viewpoint is provided.

According to a second mode of deployment, the CP is provided with a SIMD processor that sequentially and exclusively uses instruction issue paths for all PEs in order to distribute instructions to each PE in the instruction / data distribution operation. The

According to a third development mode, in the systolic operation, the plurality of PEs transfer the operation results of the executed operation instructions to other PEs, and the operation results transferred from the other PEs are A SIMD processor is provided that serves as a source operand for instructions sent in an instruction / data distribution operation.

According to a fourth development mode, in the systolic operation, each of the plurality of PEs uses the data sent from the CP as a source operand for the instruction sent in the command / data distribution operation. Is provided.

According to a fifth embodiment, in the systolic operation, a global access arbitration unit (Global Access Arbiter) that arbitrates global memory access by each of the plurality of PEs and guarantees access exclusiveness to the global memory is further provided. A SIMD processor is provided.

According to a sixth development mode, a SIMD processor is provided in which each of the plurality of PEs includes a register for storing a flag for keeping the operation stopped during the instruction / data distribution operation.

According to a seventh development mode, a SIMD processor is provided in which each of the plurality of PEs includes a register that stores a flag for switching the operation between the SIMD operation and the systolic operation.

According to an eighth development mode, a SIMD processor is provided in which each of the plurality of PEs includes a selector that selects whether or not to store an instruction issued from the CP in its own instruction buffer.

The SIMD processor according to the present invention performs SIMD operation in which the entire PE array executes a single instruction issued from the control processor (CP), and the CP uses the instruction issue path to transmit different instruction codes and data. The instruction / data distribution operation to be sequentially transmitted to each PE is performed, and each PE in the PE array receives the instruction transmitted from the CP by the instruction / data distribution operation, not the instruction broadcast from the CP every cycle. When the source operands of the instruction are ready, a systolic operation is performed to specify that the execution result of the instruction is written to the register resource of another designated PE. In addition, during the systolic operation, each PE starts executing the instruction after the operands of the instruction are written by data sent from other PEs or CPs. It is sent to another PE, and in the case of a memory access command, the global memory is accessed.

Next, effects brought about by the SIMD processor according to the present invention will be described.

In the conventional SIMD processor, since all PEs execute the same instruction, memory access instructions are generated simultaneously in all PEs. Therefore, in order to extract the performance of the SIMD processor, it is necessary to limit the memory space accessible to each PE to a space local to each PE.

However, according to the SIMD processor of the present invention, different instructions can be issued to each PE by the instruction / data distribution operation. Accordingly, it is possible to shift the timing of executing the memory access instruction between the PEs, or to allow only a specific PE to execute the memory access instruction. At this time, even if the data arrangement on the memory is left as it is and the memory space accessible by the PE is expanded to the global memory space, there are a number of cases when using the global memory space that is a single hardware resource. The frequency of competing PEs can be reduced, and the processing performance of the processing resources of the processor can be improved.

Therefore, according to the SIMD processor according to the present invention, it is possible to speed up the processing using the operation resource of the SIMD processor without changing the data arrangement on the memory of the program (first effect).

In addition, the CP performs an instruction / data distribution operation on the PE array, so that each arithmetic instruction corresponding to each arithmetic node and related constant data when processing for one iteration is expressed as a data flow graph It is possible to distribute to each PE and assign each operation node in the data flow graph to each PE. Further, in order for the CP to start execution in the systolic operation mode in the PE array, the activation data is repeatedly issued to one or more PEs, thereby starting up the different iterations one after another. The array can be processed.

Therefore, according to the SIMD processor according to the present invention, the processing for the portion (parallel loop portion) having no data dependency among the iterations in the loop portion of the program can be performed without changing the data arrangement on the global memory. By performing processing in a pipeline manner using an array, the speed can be increased (second effect).

In order to realize the above instruction / data distribution operation, the circuit resources of the instruction issue path to all PEs provided in the conventional SIMD processor and the data wiring circuit resources from each PE to the CP and from the CP to each PE are allocated. It can be used as it is. Further, as a connection for accessing the global memory space from each PE, a wiring circuit resource for exchanging scalar data between a CP and a PE array provided in a conventional SIMD processor can be used as it is. Furthermore, between PEs that perform operations in which operation nodes on the data flow graph transmit / receive data to / from each other, as a connection between PEs for transmitting / receiving data, the wiring resources between adjacent PEs provided in the conventional SIMD processor are used as they are. Can be used.

Therefore, according to the SIMD processor of the present invention, the first effect and the second effect can be obtained (third effect) only by adding a small amount of circuit to the conventional SIMD processor.

(Embodiment)
Next, the SIMD processor according to the embodiment will be described with reference to the drawings.

Before describing the SIMD processor of this embodiment, the configuration of a conventional SIMD processor is shown for comparison. FIG. 10 is a diagram showing a configuration of one PE 110 in a PE array included in a conventional SIMD processor.

Referring to FIG. 10, the PE 110 stores an instruction buffer (instb) 111 for storing instructions issued from the CP 130 (FIG. 9), general purpose registers (General Purpose Registers) r0 to r7, and arithmetic units (ALU: Arithmetic Logic Unit). ) 112, and an entry / exit to the connection network between PEs (Left / Right Inter PE Connection), and a local memory 120 for each PE (Local Memory). All the PEs 110 simultaneously execute a single instruction issued from the CP 130 every cycle.

FIG. 1 shows a configuration of 1 PE 10 in a PE array included in a SIMD processor according to this embodiment. Referring to FIG. 1, the PE 10 includes an instruction buffer (instb) 11, general purpose registers (General Purpose Registers) r 0 to r 15, an arithmetic unit (ALU: Arithmetic Logic Unit) 12, and an entrance / exit to a connection network between PEs (Left / Right Interchange). PE Connection) and local memory 20 for each PE. The PE 10 further includes registers stop and mode and a selector sel. In FIG. 1, components (thick line portions) added to FIG. 10, that is, registers stop, mode, selector sel, registers cm, sx, and sy will be described.

The register stop is a control register for keeping the operation of the PE 10 stopped during the instruction / data distribution operation.

The register mode is a 1-bit operation mode selection register for switching between a systolic operation and a conventional SIMD operation.

The selector sel is a selector that selects whether or not an instruction issued from the CP is stored in the instruction buffer (instb) 11. The registers cm, sx, and sy are a general-purpose register (waiting register) group having a data waiting function in which a predetermined counter register is decremented each time a write to the register occurs during a systolic operation.

FIG. 2 is a diagram schematically showing the configuration of the control processor (CP) 30 included in the SIMD processor according to the present embodiment. Referring to FIG. 2, CP 30 has a data path for performing its own operation, and is an instruction / data cache (Instruction / Data Cache) 31, and a memory access adjustment unit, like CP 130 in the conventional SIMD processor. It is connected to a global memory 40 via an arbiter 33. The CP 30 reads and issues commands to be executed in its own data path and commands to be broadcast to the entire PE array from the global memory 40, and is transmitted and received between the calculation data on the CP 30 and the local memory 20 of the PE 10. Data to be read from the global memory 40 or written to the global memory 40.

FIG. 3 is a diagram illustrating an example of an instruction format in the SIMD processor according to the present embodiment. The CP 30 in the SIMD processor of the present embodiment is different from the CP 130 in the conventional SIMD processor, and sends instructions and / or data to a specific PE at the same time and executes them to a specific PE or a plurality of designated PEs. An instruction set having an instruction format as shown in FIG. 3 is used.

Referring to FIG. 3, the instruction format of CP30 has a different format according to the bit pattern of the header section of the instruction.

When the bit pattern of the “header” part is “00”, it indicates that the instruction operates on the CP 30.

When the bit pattern of the “header” portion is “01”, this indicates a PE instruction that specifies the operation of the PE array when operating in the SIMD mode.

When the bit pattern of the “header” part is “10”, it indicates that the instruction A is to be distributed to the specific PE 10, and as a subsequent instruction, an instruction B whose bit pattern of the header part is always “11” The instruction A is written to the instruction buffer (instb) 11 of the PE of the PE number indicated by the “Target PEID” part of the instruction B, and the PE register of the PE number indicated by the “Target reg ID” part of the instruction B is written. Specifies the operation of writing the “data” part of the instruction B.

When the bit pattern of the “header” part is “11”, the value of the “data” part is stored in the register of the number indicated by the “Target reg ID” part of the PE of the PE number indicated by the “Target PEID” part of the instruction. Or when the “Target reg ID” part is cm, the instruction stored in the instruction buffer (instb) 11 writes the value of the data part to the register cm of the PE 10 having cm as the source operand. specify.

FIG. 4 shows the overall configuration of the PE array in the SIMD processor according to this embodiment. Referring to FIG. 4, the bold line portion indicates a global access arbitration unit (Global Access Arbiter) 50, which is a circuit element newly added in the present invention, in addition to the individual PEs 10, with respect to the conventional SIMD processor.

The global access arbitration unit 50 is a module that manages the local memory blocks of all the PEs 20 so that they can be used together as a multi-bank cache memory body, and memory access is simultaneously generated from a large number of PEs 10 during the systolic operation mode. If this occurs, the memory access is arbitrated.

In addition, as a mounting method of the global access arbitration unit 50, the following method can be considered. In other words, when there are memory access requests from two or more PEs simultaneously, the operation of the entire PE array is temporarily stopped, the memory access requests of each PE are answered one by one, and then the operation of the PE array is resumed. An implementation method with low performance but low hardware implementation cost is conceivable. On the other hand, a mounting method that has the highest performance but high hardware implementation cost is conceivable, in which local memory blocks of all PEs are connected by a crossbar mechanism to respond to a large number of memory access requests with the shortest possible delay. Any mounting method may be adopted as long as the maximum memory access delay can be determined statically when arbitrary program code is executed in a superimposed manner.

Next, the operation of the entire SIMD processor of the embodiment will be described. First, for each cycle, the CP 30 simultaneously reads two instructions, the instruction at the address indicated by the value of the program counter (PC: Program Counter) 35 and the instruction at the next address. However, whether the count value of the PC 35 is incremented or decremented every cycle and how the read instruction is processed is determined by “header” of the read first instruction A. It is determined as follows according to the value of the part.

When the value of the “header” portion of the read first instruction A is “00”, this indicates that it is a CP instruction, and the count value of the PC 35 is incremented by 1 until the next cycle, and the instruction A is processed by the CP30. Executed in the department.

When the value of the “header” portion of the read first instruction A is “01”, this indicates that it is a PE instruction, the count value of the PC 35 is incremented by 1, and the instruction A is stored in the PE array by the next cycle. All PEs 10 execute instruction A in the next cycle.

When the value of the “header” portion of the read first instruction A is “10”, it indicates that the instruction / data distribution operation is designated, and the count value of the PC 35 is incremented by 2, and by the next cycle, The register stop of all PEs 10 is set to ON, the instruction buffer (instb) 11 of the PE of the PE number designated by the “target PEID” part of the subsequent instruction B, and the “counter” of the instruction A in the “header” part of the instruction A The instruction A is written after the 2 bits of the copy are copied. Similarly, by the next cycle, the value of the “data” part of the instruction B is written to the register specified by the “Target reg ID” part of the instruction B included in the PE.

Finally, when the value of the “header” portion of the read first instruction A is “01”, this indicates that the systolic operation is designated, and the count value of the PC 35 is incremented by one, and until the next cycle The register stop and register mode of all PEs are set to OFF and ON, respectively, and the register specified by the “Target reg ID” part of the instruction A included in the PE of the PE number specified by the “target PEID” part of the instruction A The value of the “data” part of the instruction A is written in

Referring to FIG. 3, an instruction having a “header” value of “10” includes a “counter” portion. It is assumed that a value equal to or greater than the number of “waiting registers” among the source operands of the instruction is set in the “counter” portion.

On the PE side, every time a write operation occurs to any “waiting register” designated as the source operand of the instruction C stored in the instruction buffer (instb) 11 while the register mode is ON, The “counter” part of instruction C is decremented by one. Then, the execution of the instruction C is started in the cycle in which the “counter” part is decremented to zero. Execution of instruction C is completed in one cycle, and the execution result is written to the register specified by the “Destination reg ID” part of instruction C included in the PE of the PE number specified by the “PEID” part of instruction C. In addition, the execution result value is sent out on the PE coupling line by the next cycle. Simultaneously with the end of execution of the instruction C, the value of the “header” part of the instruction C is copied to the “counter” part of the instruction C.

Therefore, for example, when CP30 executes an instruction A in which the waiting register sx is specified in the “Target reg ID” portion in the systolic operation, the PE number specified by the “target PEID” portion of the instruction A is on the PE. A write operation P is performed on the register sx. Here, the value “1” is set in the “counter” portion of the instruction C in the instruction buffer (instb) 11 of the PE, and the register sx is the only “waiting register” in the source operand of the instruction C. If there is, as a result of the write operation P to the register sx, the PE starts execution of the instruction C, and at the same time as the execution of the instruction C ends, the “counter” part of the instruction C includes The value 1 is set again. However, when the write operation to the queuing register sx of the PE occurs again in the same cycle as the end of the execution of the instruction C, the “counter” portion is not 1 but 0 which is the decrement result. Since it is set, the instruction C is continuously executed on the PE.

As another example, when CP30 executes an instruction A in which the waiting register cm is specified in the “Target reg ID” part in the systolic operation, all having cm as a source operand in the instruction buffer (instb) 11 The writing operation P to cm occurs with respect to the PE. Here, when the value “2” is set in the “counter” part of the instruction C in the instruction buffer (instb) 11 of the PE, and cm and sy are both source operands of the instruction C, the above CP is used. The PE does not start executing the instruction C only by the write operation P to cm caused by the execution of the instruction A. However, the PE does not start the execution of the instruction C. When the write operation Q occurs, the PE starts executing the instruction C. At the same time as the end of execution of the instruction C, the value 2 is copied from the “header” part of the instruction C and set again in the “counter” part of the instruction C. When a write operation occurs simultaneously with respect to the queuing registers sx and cm of the PE, 0, which is the result of decrementing this twice only, is set in the “counter” section again. Therefore, the instruction C is continuously executed on the PE.

As can be seen from the above two operation examples, the CP 30 can instruct the PE to execute the instruction stored in the instruction buffer (instb) 11 by issuing the “systolic operation” instruction instruction. If the instruction stored in the instruction buffer (instb) 11 designates the “waiting register” of another PE as the write destination of the execution result, the execution of the instruction propagates between the PEs. . Further, since the CP 30 can perform a write operation on the registers cm of a large number of PEs, it is possible to simultaneously shift a large number of PEs from the “waiting” state to the “execution” state. In this manner, the CP 30 can cause a systolic instruction execution chain on the PE array by issuing a “systolic operation” instruction.

Next, the operation of the SIMD processor according to the embodiment will be described using a specific example. FIG. 5A shows pseudo code corresponding to the loop portion of the process of mapping to the PE array in this embodiment. Referring to FIG. 5 (a), the pseudo code reads the data from the array A, adds the variable a, and writes it to the array B, for a total of 8 iterations for elements 0 to 7 of the array A. This is the program code to be executed in the configuration.

FIG. 5B shows a case where the processing of FIG. 5A is mapped to PEs PE0, PE1, PE2, PE4 and PE10 in the PE array group of the SIMD processor of this embodiment. Indicates the instruction to be performed.

Here, add shown in FIG. 5B means an addition instruction, and has two source register number designations (A and cm) and one destination register number designation (1sx). A single alphabet (A, B, a) represents a constant (A is the absolute address of array A in this case). When executing the instruction stored in the instruction buffer (instb) 11, if a constant is specified as an operand, it is assumed that the constant is stored in the register of register number 0 (ie, r0), and the register r0 Operates to read a value. The destination register number is specified by a combination of a PE number and a register name. For example, if it is 1 sx, the operation is performed so that the operation result is stored in the sx register of the PE with PE number 1.

Referring to FIGS. 1 and 3, in order for CP30 to store the instruction “add A, cm, 1sx” in the instruction buffer (instb) 11 of PE0, the “header” part is set to 10, and the “opcode” part is set. A bit string representing an add instruction, “1st operating reg ID” is 0, “2nd operating reg ID” is 0xd, “Destination reg ID” is 0xe, and (the PE number of the operation result storage destination is 1) “PEID part ”,“ Counter ”part is 1 instruction,“ header ”part is 11,“ data ”part is absolute address of array A,“ Target reg ID ”part is 0, (add Since the PE number of the owner of the instruction storage instb is 0) "Target. EID "the prepared instructions and which is set to 0, it is sufficient to run the CP30.

In addition, gld and gst shown in FIG. 5B are a load instruction and a store instruction for the global memory, respectively. The load instruction has the load target address as the first source operand without having the second source operand, and has the designation of the destination register number of the storage destination of the loaded data. The store instruction has a store target address as the first source operand, a register number storing the write data as the second source operand, and does not have a destination register number designation (indicated as NULL in FIG. 5B). As described above, in FIG. 5B, 1sx, 2sx, 4sx, 4sy, and NULL in the destination field designation of each instruction are sx of PE1, sx of PE2, sx of PE4, Indicates sy and no destination.

FIG. 5 (c) shows, as an example, a time chart of the operation from when the instruction code of FIG. 5 (b) is distributed to the PE array by the CP 30 until the operation ends. Referring to FIG. 5C, the vertical axis represents time (unit: cycle), and the horizontal axis represents the operation of the CP 30 and the operation on the PE side. The operation on the PE side is displayed separately for each iteration. In the column of FIG. 5C, the operation status of the CP 30 and PE in each cycle is shown.

For example, INSTB_BC (PE0) written at the top of the column indicating the operation of CP30 reads an instruction whose “header” portion is “10” (and an instruction whose subsequent “header” portion is “11”). Indicates that the operation of distributing to PE0 has occurred in the cycle. In addition, GO (1, cm) means that an instruction whose “header” part is “11” and whose value of the “data” part is “1” is read out of PE0, PE1, PE2, PE4, and PE10. This indicates that an operation in which the instruction in the instruction buffer (instb) 11 writes 1 as the value of the “data” part of the instruction to the register cm of the PE including the register cm as a source operand has occurred in the cycle. In addition, the CLD and CST perform load operation and store operation via the cache memory or the like to the global memory generated as a result of the arbitration by the global access arbitration unit 50 due to the issuing of the gld and gst instructions on the PE side, respectively. Represents the cycle that was started. Further, PEx represents a cycle in which a PE with a PE number x executes an instruction. In particular, PEx / y indicates that PEx and PEy executed an instruction in the same cycle.

In addition, in order to make it easy to understand the timing correspondence between CLD and CST on CP30, * or + is added to the end of PEx that executed the gld or gst instruction. As an example, a dotted arrow indicates a flow from when the gld instruction is executed on PE1 until load data is sent to PE2. Further, a black rectangle in FIG. 5C represents a load data waiting cycle, and “-” represents a transfer cycle between PEs.

FIG. 6A shows the usage status of the inter-PE connection network in each cycle in the i = 0 iteration when the SIMD processor of the present invention has only a bidirectional one-dimensional adjacent PE connection network. Here, the filled black circle indicates that the instruction is executed on the corresponding PE (horizontal axis). An arrow PEx → PEy indicates that data transfer has occurred between PEx and PEy. Here, it is assumed that the load access delay for the global memory 40 is three cycles. Therefore, the arrow from PE1 to PE2 extends over 3 cycles.

FIG. 6 (b) shows the usage status of the PE network for each cycle in a total of 8 iterations from i = 0 to i = 7. Referring to FIG. 6B, no crossing occurs between a plurality of arrows in the same direction. This indicates that there is no collision regarding data transfer using the coupling line between PEs over a total of 21 cycles in which the PE array performs systolic operation. In FIG. 6B, the arrow in the left → right (or left ← right) direction from PEx to PEy in cycle P indicates the connection between adjacent PEs in the direction of PEx → PEy (or PEx ← PEy) in the cycle. Indicates that data transfer is performed using a line. Also, in FIG. 6B, the brightness of the arrows and filled circles is changed in order to make it easy to distinguish between individual iterations.

FIGS. 5C and 6 show diagrams assuming that the delay of the load access by the PE for the global memory is 3 cycles. On the other hand, when the delay is smaller than 3, for example, 2, the instruction assigned to PE10 may be assigned to PE9 as shown in FIG. When the load access delay is larger than 3, for example, 4, for example, the instruction assigned to PE10 may be assigned to PE11 as shown in FIG. 7B.

Next, effects of the present example of the SIMD processor according to the embodiment will be described. FIG. 8 shows pseudo code when the program code of FIG. 5A is sequentially executed on the CP 30. Referring to FIG. 8, CADD represents an add instruction. CLD and CST represent a memory load instruction and a memory store instruction, respectively. These are all instructions whose “header” part is 00.

When only the CP30 is used and the program code of FIG. 5 (a) is executed, approximately 6 cycles are required for each iteration, and therefore, a total of 48 cycles are required for 8 iterations. On the other hand, referring to FIG. 6B, the processing is completed in about 21 cycles by using the SIMD processor according to the embodiment. Therefore, according to the present invention, a speed increase of about twice or more can be realized.

In this embodiment, since a total of 5 cycles are required to distribute the instructions to the PE array, the performance improvement is small when the number of iterations is small. However, if the number of iterations is 1000, the 5 cycles required to distribute the instructions to the PE array can be ignored. Further, in this embodiment, it can be executed with a throughput of one cycle for each iteration. On the other hand, referring to FIG. 7, when the same processing is executed on the CP 30, it takes 6 cycles for each iteration. Therefore, the SIMD processor of the present invention provides a performance improvement of about 6 times.

It should be noted that the disclosures of the above patent documents and non-patent documents are incorporated herein by reference. Within the scope of the entire disclosure (including claims) of the present invention, the embodiments and examples can be changed and adjusted based on the basic technical concept. Various combinations and selections of various disclosed elements are possible within the scope of the claims of the present invention. That is, the present invention of course includes various variations and modifications that could be made by those skilled in the art according to the entire disclosure including the claims and the technical idea.

10, 110, PE0 to PE14 Processing element (PE)
11, 111 Instruction buffer (instb)
12, 32, 112 arithmetic unit (ALU)
20, 120 Local memory (Local Memory)
30, 130 Control processor (CP)
31 Instruction / data cache 33 Arbiter
35 Program Counter (PC: Program Counter)
40, 140 Global Memory (Global Memory)
50 Global Access Arbiter
cm, mode, r0 to r15, stop, sx, xy register sel selector

Claims

A SIMD (Single Instruction Multiple Data) processor comprising a control processor (CP: Control Processor) and a plurality of computing elements (PE: Processing Element),
The plurality of PEs perform a SIMD operation to execute a single instruction issued from the CP,
The CP performs a command / data distribution operation for distributing different commands or different commands and data to each of the plurality of PEs,
The plurality of PEs each perform a systolic operation of executing an instruction sent from the CP in the instruction / data distribution operation after the source operands of the instruction are prepared.
The SIMD processor according to claim 1, wherein the CP sequentially and exclusively uses an instruction issue path for all PEs in order to distribute instructions to each PE in the instruction / data distribution operation. .
In the systolic operation, each of the plurality of PEs transfers the operation result of the executed operation instruction to another PE, and the operation result transferred from the other PE is sent in the instruction / data distribution operation. 3. The SIMD processor according to claim 1, wherein the SIMD processor is used as a source operand for an instruction.
In the systolic operation, each of the plurality of PEs uses the data sent from the CP as a source operand for the instruction sent in the command / data distribution operation. The SIMD processor according to 1.
5. The system according to claim 1, further comprising a global access arbitration unit that arbitrates global memory access by each of the plurality of PEs in the systolic operation and guarantees exclusiveness of access to the global memory. The SIMD processor according to any one of the above.
The plurality of PEs each include a register for storing a flag for keeping the operation stopped during the instruction / data distribution operation. The SIMD processor described.
The plurality of PEs each include a register that stores a flag for switching an operation between the SIMD operation and the systolic operation. The SIMD processor according to 1.
The plurality of PEs each include a selector for selecting whether or not to store an instruction issued from the CP in its own instruction buffer. The SIMD processor according to 1.