US20120147016A1 - Image processing device and image processing method - Google Patents

Image processing device and image processing method Download PDF

Info

Publication number
US20120147016A1
US20120147016A1 US13/392,510 US201013392510A US2012147016A1 US 20120147016 A1 US20120147016 A1 US 20120147016A1 US 201013392510 A US201013392510 A US 201013392510A US 2012147016 A1 US2012147016 A1 US 2012147016A1
Authority
US
United States
Prior art keywords
image processing
processing
parallel
image
cpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/392,510
Inventor
Masatoshi Ishikawa
Takashi Komuro
Tomohira Tabata
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Tokyo NUC
Original Assignee
University of Tokyo NUC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Tokyo NUC filed Critical University of Tokyo NUC
Assigned to THE UNIVERSITY OF TOKYO reassignment THE UNIVERSITY OF TOKYO ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KOMURO, TAKASHI, TABATA, TOMOHIRA, ISHIKAWA, MASATOSHI
Publication of US20120147016A1 publication Critical patent/US20120147016A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Definitions

  • the present invention relates to a device and method suitable for high speed processing of images.
  • a frame rate has been used that has an upper limit of a video frame rate (24-60 fps) determined based on human visual characteristics.
  • high-speed vision real time vision
  • a high frame rate camera in the order of 1000 fps, far in excess of the video frame rate.
  • non-patent publications 6 and 7 below since high-speed vision can measure fast movement, it is also applied to somatoscopy (non-patent publications 6 and 7 below), motion capture (non-patent publication 8 below) and fluid measurement (non-patent publication 9 below).
  • a CPU used in a normal combination system is powerless compared to a CPU for a PC, and so there is a need for acceleration of image processing using a co-processor.
  • High-speed vision systems hitherto developed have attempted to make calculations high speed by adopting SIMD type massively parallel processors (non-patent publication 10 below), and implementing dedicated circuits in an FPGA (field programmable gate array) which is an LSI whose hardware structure can be rewritten (non-patent publications 17 and 18 below).
  • An SIMD type massively parallel processor can be implemented with extremely high performance in the case of carrying out processing uniformly on a lot of pixels (non-patent publications 19-22 below).
  • PE processing elements
  • a focal plane processor that carries out calculation processing on an image plane of an image sensor can also be said to be improving high frame rate processing, but due to constraints in circuit area is often designed specialized to specific processing.
  • There has also been development of technology to perform general-purpose calculation (non-patent publications 23-26 below), but these suffer from the same problems as for the SIMD type massively parallel processor described above.
  • DSP digital signal processor
  • parallel processing such as VLIW (Very Long Instruction Word) or multicore technology
  • VLIW Very Long Instruction Word
  • multicore technology has become prominent and enables high-speed processing (non-patent publications 27 and 28 below).
  • VLIW Very Long Instruction Word
  • the time required to execute instructions can not be predicted in advance, or execution speed is lowered for the reason that is not anticipated.
  • the present invention has been conceived in view of the above-described situation.
  • the main object of the present invention is to make image processing high-speed by causing designation and operation of a plurality of image processing sections corresponding to a specific function for image processing in accordance with a program.
  • An image processing device comprising:
  • the frame memory being configured to store image data that is to be processed
  • the coprocessor being provided with a plurality of image processing sections and a plurality of parallel memories
  • the parallel memories being configured to receive all or part of the image data that has been stored in a frame memory and transmitting to any of the image processing sections,
  • the plurality of image processing sections being configured to, in accordance with instruction from a CPU, receive all or part of the image data from the parallel memories or the frame memory, and perform processing on all or part of the image data in accordance with a function for the image processing.
  • the image processing sections correspond to specific functions used in image processing. In the case of carrying out image processing, processing can be made high speed by carrying out execution of functions required for processing in dedicated image processing sections. Further, in a program, it is possible to execute processing by designating a specific function or image processing section.
  • the reconfigurable programmable logic devices are integrated circuits normally referred to as FPGAs or FPGAs.
  • FPGAs field-programmable gate arrays
  • FPGAs field-programmable gate arrays
  • the image processing sections comprise a direct memory access controller and a processing unit, the direct memory access controller being configured to control operation of the parallel memory, and the processing unit being configured to carry out processing in accordance with a function for the image processing.
  • the coprocessor is further provided with a descriptor, the CPU being configured to write commands for a coprocessor to the descriptor, and the coprocessors being configured to read commands that have been written to the descriptor, and execute processing using the plurality of image processing sections.
  • the CPU can designate a plurality of processes for the coprocessor at one time. As a result, there is the advantage that it is possible to reduce the number of times interrupts are issued to the CPU at the time of operation completion by the co-processor.
  • parallel processing becomes possible at a task level, in accordance with commands from the CPU. Also, by writing process sequences at a processing unit and waiting unit into the descriptor, it is possible to efficiently carry out parallel processing at a task level.
  • An image processing method provided with the following steps:
  • FIG. 1 is a schematic block diagram of an image processing device of one embodiment of the present invention.
  • FIG. 2 is a flowchart showing an overview of an image processing device using the device of FIG. 1 .
  • FIG. 3 is a schematic hardware structure diagram of the device of FIG. 1 .
  • This image processing device comprises, as main elements, coprocessors 11 , 12 , . . . 1 P of P in number, frame memories 21 , 22 , . . . 2 P of P in number, and a CPU 3 .
  • This device is further provided with a main memory 4 , an I/O interface 5 , a camera interface 6 , a video interface 7 , a CPU bus 8 , and an inter-coprocessor bus 9 .
  • Each frame memory 21 . . . is configured to store image data that will be processed. Specifically, with this embodiment, each frame memory is configured to store image data acquired from the camera interface 6 or the video interface 7 . As illustrated in the drawings, each frame memory 21 . . . is provided in correspondence with each coprocessor 11 . . . .
  • the coprocessors 11 . . . are each provided with a plurality of direct memory access controllers (DMAC) 111 , 112 . . . , 11 N, a plurality of parallel memories 121 , 122 , . . . , 12 M, and a plurality of processing units 13 A, 13 B, . . . , 13 X.
  • DMAC direct memory access controllers
  • the specific internal structure of each coprocessor is the same in this embodiment, and so detailed description will only be given for the internal structure of coprocessor 11 .
  • the plurality of image processing sections of the present invention are constituted by the DMACs 111 . . . and the processing units 13 A . . . .
  • the DMACs and the processing units are not provided in one-to-one correspondence.
  • the fact that there are a plurality of processing units means that there are a priority of image processing sections.
  • the DMACs 111 handle an image processing function there are a plurality of DMACs, and it is also possible to understand that there are a plurality of image processing sections.
  • the DMACs 111 . . . are configured to control operation of the parallel memories 121 . . . . However, with this embodiment, the DMAC 111 cooperates with the processing units 13 A . . . so as to execute functions of the image processing.
  • the processing units 13 A . . . are configured corresponding to functions for image processing.
  • the parallel memories 121 . . . acquire all or part of image data that has been stored in the frame memory 21 , and transmit the data to any of the processing units 13 a . . . via the DMACs.
  • dual port memory is used as the parallel memory 121 . . . of this embodiment.
  • the plurality of DMACs 111 . . . and processing unit sections 13 A . . . of this embodiment each have a function corresponding to a function for image processing. However, it is also possible to have a structure where only the processing units 13 A . . . handle this function.
  • the DMACs 111 . . . and the processing unit sections 13 A . . . are configured to acquire all or part of image data from the parallel memories 121 . . . or the frame memory 21 , in accordance with commands from the CPU. Further, the DMACs 111 . . . and the processing unit sections 13 A . . . carry out image processing in accordance with a function for image processing on all or part of the image data.
  • the coprocessors 11 . . . of this embodiment are configured using reconfigurable programmable logic devices, specifically, so-called FPGAs. Accordingly, the number and capacity of parallel memories 121 . . . of the coprocessors 11 . . . , and the number and functions of the DMACs 111 . . . and the processing units 13 A . . . , can be changed by rewriting the coprocessors 11 . . . .
  • the I/O interface 5 is a section for controlling input and output operations between external devices (not illustrated).
  • the camera interface 6 has a function for acquiring images from a camera (not shown).
  • the video interface 7 has a function for acquiring images from a video (not shown).
  • the CPU bus 8 is a bus for carrying out data transfer between the CPU and each of the co-processors 11 . . . .
  • the inter-coprocessor bus 9 is a bus for carrying out data transfer between each of the co-processors 11 . . . .
  • Each of the coprocessors 11 . . . is further provided with a descriptor 141 .
  • a descriptor 141 is a register-array for writing contents of image processing and direct memory access (DMA) in accordance with commands from the CPU 3 . Specifically, the CPU 3 of this embodiment writes commands for the coprocessors 11 . . . to the descriptor 141 .
  • DMA direct memory access
  • the coprocessors 11 . . . read out commands written in the descriptor 141 , and execute processing using the DMACs 111 . . . and the processing units 13 A . . . (specifically, processing using the plurality of image processing sections).
  • image data constituting a subject of processing is acquired from the camera interface 6 or the video interface 7 in accordance with commands from the CPU 3 .
  • frame memories 21 . . . corresponding to coprocessors 11 . . . that will process the image store the image or part of the image. This processing is also carried out in accordance with commands from the CPU 3 .
  • the CPU 3 writes commands for each of the coprocessors 11 . . . to a respective descriptor 141 .
  • the CPU 3 writes the following information (commands) to each descriptor 141 .
  • each of the coprocessors 11 . . . reads out commands that have been written into the descriptor 141 .
  • each of the coprocessors 11 . . . reads out commands written in the descriptor 141 , and assigns processing to each image processing section (DMAC and processing unit). Respective DMACs and processing units are operated independently and in parallel. For example, carrying out coordinate change while carrying out summation calculation is also possible.
  • a descriptor system it is possible to designate a plurality of processes in a coprocessor at one time, which means it is possible to reduce the number of times interrupts are issued to the CPU at the time of operation completion by the coprocessor.
  • the image processing sections acquire all or part of an image from the frame memories 21 . . . or from the parallel memory 121 . . . , and perform processing. This processing will be described in detail in the following.
  • a module for sorting processes being written to the descriptor (this module can be constructed within the descriptor, for example) operates as follows.
  • a processing unit includes a DMAC (in the case of having an image processing function).
  • processing units 13 A . . . or DMACs 111 . . . for carrying out basic image processing are prepared in advance within the coprocessors 11 . . . , and by using these in combination it is possible to implement various algorithms.
  • this embodiment since it is not necessary to carry out circuit design every time it is intended to change an algorithm, the burden on the user is reduced. Also, since with this embodiment it is possible to execute specific functions at high speed using the processing units 13 A . . . or the DMACs 111 . . . , it is possible to carry out image processing at high speed.
  • UNIT_A 1 and UNIT_A 2 represent processing modules having the same function.
  • proc_A (CP — 1, UNIT_A1, 0, . . . );
  • proc_B (CP — 1, UNIT_B, 0, . . . );
  • the function proc_X takes the following format.
  • the function name represents the type of processing.
  • cp represents a co-processor used
  • unit represents a processing unit used
  • wunit represents a waiting unit.
  • memory and address to be used, image size and calculation parameters etc. are also designated by an argument(s).
  • firmware enters a wait state until a designated processing unit is empty.
  • the coprocessor cp and the waiting unit wunit to be used are contained in the argument for sync( ).
  • proc_A (CP — 1, UNIT —A 1, . . . );
  • proc_B (CP — 1, UNIT_B, . . . );
  • the processing units 13 A . . . are implemented as dedicated circuits for every process, and parallelization is achieved by concurrent execution of operations and pipelining. In addition, it is possible to allow processing for one pixel or more to be executed in one clock by simultaneously reading out a plurality of items of data from the parallel memories 121 . . . . This can be regarded as parallel processing within tasks.
  • INTRAM 1 In the case where there are two processing units for summation calculation, if INTRAM 1 is being made an input at one of the processing units, INTRAM 1 cannot also be made an input to the other processing unit.
  • the parallel memories 121 . . . are constituted by dual port RAM, as with this embodiment, reading and writing can be carried out independently. As a result, even in the case where an output destination of a process is the input source of the next process, in the case where it is possible to confirm that the subsequent process does not overtake the previous process, it is possible to start the next process before all processing is complete.
  • FIG. 3 A block diagram of the developed system is shown in FIG. 3 . This system is implemented on a single mainboard on which two sub boards are mounted.
  • FPGAs, memory, an I/O port etc. are mounted on the main board, and the sub boards are a CPU board and a camera interface board. Correspondence between elements of the previously described practical example and the hardware of this practical example are shown below.
  • each FPGA is connected to the CPU bus by means of this expansion bus. Accordingly, functionally this expansion bus doubles as both an inter-coprocessor bus and a CPU bus.
  • the PBSRAM in FIG. 3 is not shown in FIG. 1 , but is external memory for each FPGA.
  • the CPU board uses a commercially available CPU substrate ESPT-Giga (trade name), and is connected to FPGAs within the mainboard through the expansion bus.
  • ESPT-Giga has a Renesas SH7763 (SH-4A, 266 MHz) as a CPU, 64 MB DDR-SDRAM as memory, and as input/output is provided with 10/100/1000 BASE Ethernet (Registered trademark), USB1.1, and RS232C.
  • ESPT-Giga is capable of having a built-in Web server function, and it is possible to operate the system from a PC through a web browser, and to display processing results. In this way, remote management using a LAN becomes possible, and it becomes possible to manage a plurality of systems with a single PC.
  • the PBSRAM in FIG. 3 are flash memories (8 MBytes) for storing configuration data of the FPGAs, and they are provided for each FPGA.
  • the FPGAs have respective frame memories (DRAM), and an input image from a camera is automatically stored in the frame memory of FPGA 1 .
  • a camera interface is connected to FPGA 1 .
  • a physical interface for an individual camera is used implemented on a camera interface substrate that is attached on the mainboard.
  • the camera interface corresponds to a Basler made A504k (monochrome)/kc (color), and a Microtron made Eosens MC 1362 (monochrome)/1363 (color).
  • These cameras are capable of real time output of images of a maximum of 1280 ⁇ 1024 pixels at 500 fps.
  • the previously described A504k/kc and MC1362/1363 adopt an interface with which the CameraLink standard has been specially expanded, and are connected to the board with two CameraLink cables. These cameras are compatible at the physical layer with cameras with a normal CameraLink interface, and it is therefore also possible to handle other cameras by changing the circuitry of FPGA 1 . Further, with this practical example it is also made possible to handle other camera interfaces, such as IEEE1394 or Gigabit Ethernet (registered trademark) by changing the camera interface substrate.
  • an analog VGA port is connected to FPGA 1 , making it possible to output images that have been stored in the frame memory to a display at SXGA size (1280 ⁇ 1024).
  • each FPGA has a small capacity SRAM that is separate from the frame memory.
  • this SRAM can be used as an input source for coordinate transform processing which will be described later.
  • DIO digital I/O
  • each co-processor has the following units.
  • one pixel is processed as 16 bits. Images sent from a camera are most often 8 bit, but 16 bit is made standard since there is a need for greater precision during processing for calculation. For example, results for the case of adding or subtracting 8-bit images are 9 bit. In the case of summing a lot of images with weighting, such as filter processing etc., there is a need for a greater number of bits than this.
  • Processing for color images is handled as three independent grayscale images for RGB respectively.
  • a processing unit for an image it is possible to designate handling of 16-bit input data and handling of 16-bit output data respectively, as follows.
  • T l and T h represent upper limit and lower limit of an appropriately set threshold value.
  • Coefficient parameters for image processing are set to 16 bit length or 12 bit length signed fixed point, and position of the decimal point is designated in common for each parameter.
  • Parallel memory can simultaneously read and write data of 128 bits (8 pixels) at a time. Also, this parallel memory is constituted by dual port RAM, and it is possible to carry out reading and writing independently.
  • the DMA control unit carries out transfer of data between each memory.
  • DMAC DMA control unit
  • transfer of data with the CPU is only possible with a specific DMAC (for example DMA 2 ) in each FPGA.
  • DMA 1 the only device able to transfer the data to another FPGA is another specific DMAC in each FPGA (for example, it is made DMA 1 ).
  • Data transfer between each of the memories is carried out in 128 bit units, but when transferring data with an external memory, transfer is limited to the operating speed of the external memory.
  • DMAC DMA control unit
  • This circuit outputs a result of data that has been shifted in byte units to the left, every 16 bytes.
  • src address for data there is a limitation that it must be a multiple of 16, but if a shift circuit is used it is possible to make data of an arbitrary address the src.
  • This circuit receives data input every 16 bytes and output data that has been thinned by either 8 ⁇ 1 (1 ⁇ 8th of the output data amount), 4 ⁇ 1 (1 ⁇ 4 of the output data amount), or 2 ⁇ 1 (1 ⁇ 2 of the output data amount). It is possible to carry out image compression using this function and designation of DMA transfer address increment.
  • Table 2 shows calculation units (that is, processing units) implemented in each coprocessor.
  • AFFINE can receive input from the SRAM.
  • SCALE, ARITH and SUM can perform processing simultaneously for 8 pixels in 1 clock, while 3 ⁇ 3CONV can carry out processing simultaneously for 4 pixels in 1 clock.
  • AFFINE carries out processing for only 1 pixel in 1 clock
  • One instruction is made up of 1 to 3 Words depending on the number of parameters.
  • a single instruction corresponds to a single previously described proc_X( ) function, and it is possible to instruct processing for image data of a designated range using a single DMA control unit or image processing unit.
  • An instruction to do nothing at all is also provided, and this corresponds to the sync( ) function.
  • the FPGAs used in this practical example operate at 200 MHz, and used resources are 88% for FPGA 1 and 81% for FPGA 2 .
  • Table 3 shows processing that uses the basic functions of the system of this practical example, and computing time for processing that combines basic functions. For the purposes of comparison, computing time for the case where the same processing is implemented on a PC using OpenCV is also shown.
  • the PC used had an Intel E6300 (1.86 GHz ⁇ 2) CPU, 3 GB of RAM, implementation using Visual Studio 2005 and OpenCV V1.0, and was measured on Windows (registered trademark) XP.
  • EvalSys in the table shows processing time for the case of using the developed evaluation system
  • OpenCV shows processing time for the case of using a PC and OpenCV.
  • input source and output destination are set to parallel memory within the FPGA, and image size is set to 256 ⁇ 32.
  • Centroid computation first extracts a region, in which a subject is predicted to exist from results of a previous frame, from the input image, and binarizes with a fixed or adaptively defined threshold value. Next, computation of a centroid is carried out using the following equations.
  • m 00 ⁇ x , y ⁇ I ⁇ ( x , y ) ( 1 )
  • m 10 ⁇ x , y ⁇ xI ⁇ ( x , y )
  • m 01 ⁇ x , y ⁇ yI ⁇ ( x , y ) ( 2 )
  • x c m 10 / m 00
  • y c m 01 ⁇ m 00 ( 3 )
  • Weights I x and I y at an (x, y) coordinate value are respectively input to parallel memory in advance, the binarized input image is weighted using ARITH, and moments m 10 and m 01 are computed by summation calculation using SUM. A moment m 00 is obtained my summation calculation without weighting.
  • the developed board operates with a 12V, 5 A power supply, and effective power consumption is about 42 W. Because the power consumption of an FPGA is comparatively high, power consumption becomes high compared to the case of using a DSP or the like, but with this practical example there is the advantage that by using a built-in system it is possible to ensure stability and reliability
  • the reconfigurable programmable logic devices are integrated circuits normally referred to as FPGAs.
  • FPGA field-programmable gate array
  • the CPU can designate a plurality of processes for the coprocessor at one time. As a result, there is the advantage that it is possible to reduce the number of times interrupts are issued to the CPU at the time of operation completion by the coprocessors.
  • the image processing sections correspond to specific functions used in image processing. In the case of carrying out image processing, processing can be made high speed by carrying out execution of functions required for processing in dedicated image processing sections. Further, in a program, it is possible to execute processing by designating specific functions or image processing sections.

Abstract

Disclosed are an image processing device and an image processing method which achieve an increase in the speed of image processing by designating and operating a plurality of image processing units each corresponding to a specific function for the image processing in accordance with a program. A frame memory (21 . . . ) stores image data to be processed. Parallel memories (121 . . . ) each receive all or part of the image data stored in the frame memory (21 . . . ) and transmit the received image data to any of DMACs (111 . . . ) or processing units (13A . . . ) for the image processing. The processing units (13A . . . ) each have a function corresponding to a function for the image processing. The processing units (13A . . . ) each receive all or part of the image data from the parallel memory (121 . . . ) or the frame memory (21 . . . ) in accordance with a command from a CPU (3) and perform processing based on the function for the image processing on all or part of the image data.

Description

    TECHNICAL FIELD
  • The present invention relates to a device and method suitable for high speed processing of images.
  • BACKGROUND ART
  • Often, with conventional machine vision and robot vision, a frame rate has been used that has an upper limit of a video frame rate (24-60 fps) determined based on human visual characteristics. In contrast, research has been conducted into real time vision (hereafter referred to as high-speed vision) using a high frame rate camera in the order of 1000 fps, far in excess of the video frame rate.
  • For example, by using high-speed vision, stabilized visual feedback control becomes possible, and so high-speed vision is applied to control of robots requiring high-speed operation (non-patent references 1-5 below).
  • Also, since high-speed vision can measure fast movement, it is also applied to somatoscopy (non-patent publications 6 and 7 below), motion capture (non-patent publication 8 below) and fluid measurement (non-patent publication 9 below).
  • Besides this, there is also research using high-speed vision in improving the performance of general image processing, such as tracking (non-patent publications 10 and 11 below), three-dimensional measurement (non-patent publications 12 and 13 below), image composition (non-patent publications 14 and 15 below), optical flow estimation (non-patent publication 16 below) etc.
  • In handling this type of high frame rate moving image in real-time, high calculation performance is required. In recent years, due to the dramatic improvement in computers, even with systems that use PCs, while it is possible to attain a certain level of performance, there is a problem in that PCs lack stability and reliability. Accordingly, in order to realize practicable high-speed vision, it is been considered desirable to use a combined system. Using a combined system, makes is possible to optimize a hardware structure to the objective of use, and also leads to miniaturization of the system.
  • On the other hand, a CPU used in a normal combination system is powerless compared to a CPU for a PC, and so there is a need for acceleration of image processing using a co-processor. High-speed vision systems hitherto developed have attempted to make calculations high speed by adopting SIMD type massively parallel processors (non-patent publication 10 below), and implementing dedicated circuits in an FPGA (field programmable gate array) which is an LSI whose hardware structure can be rewritten (non-patent publications 17 and 18 below).
  • An SIMD type massively parallel processor can be implemented with extremely high performance in the case of carrying out processing uniformly on a lot of pixels (non-patent publications 19-22 below). However, in the case of carrying out processing for part of an image, it is not always possible to make practical effective use of all processing elements (PE). With a lot of applications that use high-speed vision, there is a need to carry out processing for localized regions, such as for tracking, at a higher speed than processing for an entire image, and since calculation of small regions is central, this problem is serious.
  • Also, in a lot of cases, data transfer between PEs is only possible between adjacent PEs, and efficient implementation of geometric transforms, such as scaling or rotation, is difficult. Accordingly, there are limitations on the algorithms that can be implemented.
  • Besides this, a focal plane processor that carries out calculation processing on an image plane of an image sensor can also be said to be improving high frame rate processing, but due to constraints in circuit area is often designed specialized to specific processing. There has also been development of technology to perform general-purpose calculation (non-patent publications 23-26 below), but these suffer from the same problems as for the SIMD type massively parallel processor described above.
  • It is also conceivable to use DSP in image processing. In recent years, DSP adopting parallel processing such as VLIW (Very Long Instruction Word) or multicore technology has become prominent and enables high-speed processing (non-patent publications 27 and 28 below). However, in an architecture using VLIW, since parallelization of algorithms is mainly performed automatically by a compiler, the time required to execute instructions can not be predicted in advance, or execution speed is lowered for the reason that is not anticipated.
  • In contrast, since it is possible for ASICs and FPGAs to directly implement parallelism possessed by algorithms in hardware, efficiency in parallelization is high, and it is easy to optimize processing. In particular, FPGAs, which are LSIs capable of having their hardware structure rewritten, are suited to prototypes and low volume production. On the other hand, with a system using FPGAs, circuit design using a HDL (hardware description language) is required every time an attempt is made to change an algorithm, and there is a problem that development costs are high.
  • In contrast, a system that can customize an instruction set of a general purpose CPU using reconfigurability of an FPGA has also been proposed (non-patent publication 29 below). With this system, the simplicity of software development possessed by a CPU and the reconfigurability of an FPGA are combined, and it is possible to minimize the users circuit design burden. However, this system can only use CPUs that have been prepared in advance, and it is not possible to make maximum practical use of business development tools, or middleware and software resources etc.
  • According to the findings of the present inventors, if performance degradation due to interrupts and multitasking etc. is considered, then is a desired to isolate CPUs and FPGAs as much as possible, and for FPGAs to function autonomously. There is also a desire to prepare structures for parallel processing and parallel data access required for image processing, and high-speed data transfer, in advance.
  • CITATION LIST Non-Patent Literature
    • Non-patent literature 1:
    • A. Namiki, Y. Nakabo, I. Ishii, and M. Ishikawa, “1 ms sensory-motor fusion system,” IEEE Transactions on Mechatoronics, Vol. 5, No. 3, pp. 244-252, 2000.
    • Non-patent literature 2:
    • Y. Nakamura, K. Kishi, and H. Kawakami, “Heart beat synchronization for robotic cardiac surgery,” Proc. IEEE International Conference on Robotics and Automation, pp. 2014-2019, 2001.
    • Non-patent literature 3:
    • R. Ginhoux, J. Gangloff, M. de Mathelin, L. Soler, M. Sanchez, and J. Marescaux, “Beating heart tracking in robotic surgery using 500 Hz visual servoing, model predictive control and an adaptive observer,” Proc. IEEE International Conference on Robotic sand Automation, pp. 274-279, 2004.
    • Non-patent literature 4:
    • T. Senoo, A. Namiki, and M. Ishikawa, “High-speed batting using a multi-jointed manipulator,” Proc. IEEE International Conference on Robotics and Automation, pp. 1 191-1196, 2004.
    • Non-patent literature 5:
    • N. Furukawa, A. Namiki, T. Senoo, and M. Ishikawa, “Dynamic regrasping using a high-speed multi-fingered hand and a high-speed vision system,” Proc. IEEE International Conference on Robotics and Automation, pp. 181-187, 2006.
    • Non-patent literature 6:
    • H. Oku, N. Ogawa, K. Hashimoto, and M. Ishikawa, “Two-dimensional tracking of a motile microorganism allowing high-resolution observation with various imaging techniques,” Review of Scientific Instruments, Vol. 76, No. 3, 034301, 2005.
    • Non-patent literature 7:
    • I. Ishii, Y. Nie, K. Yamamoto, K. Orito, and H. Matsuda, “Real-time and long-time quantification of behavior of laboratory mice scratching,” Proc. IEEE International Conference on Automation Science and Engineering, pp. 628-633, 2007.
    • Non-patent literature 8:
    • K. Yamane, T. Kuroda, and Y. Nakamura, “High-precision and high-speed motion capture combining heterogeneous cameras,” Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 279-286, 2004.
    • Non-patent literature 9:
    • Y. Watanabe, T. Komuro, and M. Ishikawa, “A high-speed vision system for moment-based analysis of numerous objects,” Proc. IEEE International Conference on Image Processing, pp. V177-180, 2007.
    • Non-patent literature 10:
    • Y. Nakabo, M. Ishikawa, H. Toyoda, and S. Mizuno, “1 ms column parallel vision system and its application of high speed target tracking,” Proc. IEEE International Conference on Robotics and Automation, pp. 650-655, 2000.
    • Non-patent literature 11:
    • U. Muehlmann, M. Ribo, P. Lang, and A. Pinz, “A new high speed CMOS camera for real-time tracking applications,” Proc. IEEE International Conference on Robotics and Automation, pp. 5195-5200, 2004.
    • non-patent literature 12: Y. Watanabe, T. Komuro, and M. Ishikawa, “955-fps real-time shape measurement of a moving/deforming object using high-speed vision for numerous-point analysis,” Proc. IEEE International Conference on Robotics and Automation, pp. 3192-3197, 2007.
    • Non-patent literature 13
    • I. Ishii ,K. Yamamoto, K. Doi, and T. Tsuji, “High-speed 3D image acquisition using coded structured light projection,” Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 925-930, 2007.
    • non-patent literature 14: X. Liu and A. Gamal, “Synthesis of high dynamic range motion blur free image from multiple captures,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 50, No. 4, pp. 530-539, 2003.
    • Non-patent literature 15: T. Komuro, Y. Watanabe, M. Ishikawa, and T. Narabu, “High-S/N imaging of a moving object using a high-frame-rate camera,” IEEE International Conference on Image Processing, pp. 517-520, 2008.
    • Non-patent literature 16: S. Lim, J. Apostolopoulos, and A. Gamal, “Optical flow estimation using temporally over sampled video,” IEEE Transactions on Image Processing, vol. 14, No. 8, pp. 1074-1087, 2005.
    • Non-patent literature 17: I. Ishii, K. Kato, S. Kurozumi, H. Nagai, A. Numata, and K. Tajima, “Development of a mega-pixel and milli-second vision system using intelligent pixel selection,” Proc. IEEE Technical Exhibition Based Conference on Robotics and Automation, pp. 9-10, 2004.
    • Non-patent literature 18: K. Shimizu and S. Hirai, “CMOS+FPGA vision system for visual feed back of mechanical systems,” Proc. IEEE International Conference on Robotics and Automation, pp. 2060-2065, 2006.
    • Non-patent literature 19: W. Raab, N. Bruels, U. Hachmann, J. Harnisch, U. Ramacher, C. Sauer, and A. Techmer, “A 100-GOPS programmable processor for vehicle vision systems,” IEEE Design & Test of Computers, vol. 20, No. 1, pp. 8-15, 2003.
    • Non-patent literature 20: H. Noda, M. Nakajima, K. Dosaka, K. Nakata, M. Higashida, O. Yamamoto, K. Mizumoto, T. Tanizaki, T. Gyohten, Y. Okuno, H. Kondo, Y. Shimazu, K. Arimoto, K. Saito, and T. Shimizu, “The design and implementation of the massively parallel processor based on the matrix architecture,” Proc. IEEE Journal of Solid-State Circuits, vol. 42, No. 1, pp. 183-192, 2007.
    • Non-patent literature 21: A. Abbo, R. Kleihorst, V. Choudhary, L. Sevat, P. Wielage, S. Mouy, B. Vermeulen, and M. Heijligers, “Xetal-II: A 107 GOPS, 600 mW massively parallel processor for video scene analysis,” IEEE Journal of Solid-State Circuits, vol. 43, No. 1, pp. 192-201, 2008.
    • Non-patent literature 22: S. Kyo, S. Okazaki, and T. Arai, “An integrated memory array processor architecture for embedded image recognition systems,” Proc. International Symposium on Computer Architecture, pp. 134-145, 2005.
    • Non-patent literature 23: J. Eklund, C. Svensson, and A. A°stro{umlaut over ( )}m, “VLSI implementation of a focal plane image processor—a realization of the near-sensor image processing concept,” IEEE Transactions on Very Large Scale Integrat. (VLSI) Systems, vol. 4, no. 3, pp. 322-335, 1996.
    • Non-patent literature 24: T. Komuro, S. Kagami, and M. Ishikawa, “A dynamically reconfigurable SIMD processor for a vision chip,” IEEE Journal of Solid-State Circuits, Vol. 39, No. 1, pp. 265-268, 2004.
    • Non-patent literature 25: P. Dudek and P. Hicks, “A general-purpose processor-per-pixel analog SIMD vision chip,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 52, No. 1, pp. 13-20, 2005.
    • Non-patent literature 26: W. Miao, Q. Lin, W. Zhang, and N. Wu, “A programmable SIMD vision chip for real-time vision applications,” IEEE Journal of Solid-State Circuits, vol. 43, pp. 1470-1479, 2008.
    • Non-patent literature 27: J. Tanabe, Y. Taniguchi, T. Miyamori, Y. Miyamoto, H. Takeda, M. Tarui, H. Nakayama, N. Takeda, K. Maeda, and M. Matsui, “Visconti: multi VLIW image recognition processor based on configurable processor,” Proc. IEEE Custom Integrated Circuits Conference, pp. 185-188, 2003.
    • Non-patent literature 28: B. Khailany, T. Williams, J. Lin, E. Long, M. Rygh, D. Tovey, and W. Dally, “A programmable 512 GOPS stream processor for signal, image, and video processing,” IEEE Journal of Solid-State Circuits, vol. 43, pp. 202-213, 2008.
    • Non-patent literature 29: M. Wirthlin, B. Hutchings, and K. Gilson, “The nanoprocessor: a low resource reconfigurable processor,” Proc. IEEE Workshop on FPGAs for Custom Computing Machines, pp. 23-30, 1994.
    • Non-patent literature 30: J. Farrugia, P. Horain, E. Guehenneux, and Y. Alusse, “GPUCV: a framework for image processing acceleration with graphics processors,” Proc. IEEE International Conference on Multimedia and Expo, pp. 585-588, 2006.
    SUMMARY OF THE INVENTION Technical Problem
  • The present invention has been conceived in view of the above-described situation. The main object of the present invention is to make image processing high-speed by causing designation and operation of a plurality of image processing sections corresponding to a specific function for image processing in accordance with a program.
  • Solution to the Problems
  • Means for solving the above-described problems can be described as in the following aspects.
  • (Aspect 1)
  • An image processing device, comprising:
  • a coprocessor, frame memory and a CPU,
  • the frame memory being configured to store image data that is to be processed,
  • the coprocessor being provided with a plurality of image processing sections and a plurality of parallel memories,
  • the parallel memories being configured to receive all or part of the image data that has been stored in a frame memory and transmitting to any of the image processing sections,
  • the plurality of image processing sections each having a function corresponding to a function for image processing, and
  • the plurality of image processing sections being configured to, in accordance with instruction from a CPU, receive all or part of the image data from the parallel memories or the frame memory, and perform processing on all or part of the image data in accordance with a function for the image processing.
  • The image processing sections correspond to specific functions used in image processing. In the case of carrying out image processing, processing can be made high speed by carrying out execution of functions required for processing in dedicated image processing sections. Further, in a program, it is possible to execute processing by designating a specific function or image processing section.
  • (Aspect 2)
  • The image processing device of aspect 1, wherein the coprocessor is configured using reconfigurable programmable logic device.
  • The reconfigurable programmable logic devices are integrated circuits normally referred to as FPGAs or FPGAs. By using this type of device as a coprocessor it is possible to rewrite functions of the image processing sections according to the user's needs. For example, it is possible to add image processing sections corresponding to deficient functions, or to add image processing sections corresponding to required functions.
  • (Aspect 3)
  • The image processing device of aspect 1 or 2, wherein the plurality of parallel memories are dual port memories.
  • By using dual port memories it is possible to carry out read and write to the memories independently. It is therefore possible to make processing even higher speed.
  • Also, by using dual port memory it is possible to carry out pipeline processing with parallel memory as a buffer, in accordance with the CPU commands.
  • (Aspect 4)
  • The image processing device of any one of aspects 1-3, wherein the image processing sections comprise a direct memory access controller and a processing unit, the direct memory access controller being configured to control operation of the parallel memory, and the processing unit being configured to carry out processing in accordance with a function for the image processing.
  • (Aspect 5)
  • The image processing device of any one of aspects 1-4, wherein a plurality of the coprocessors are provided.
  • (Aspect 6)
  • The image processing device of aspect 5, wherein the plurality of coprocessors are connected to a shared coprocessor bus.
  • (Aspect 7)
  • The image processing device of any one of aspects 1-6, wherein the coprocessor is further provided with a descriptor, the CPU being configured to write commands for a coprocessor to the descriptor, and the coprocessors being configured to read commands that have been written to the descriptor, and execute processing using the plurality of image processing sections.
  • By using the descriptor, the CPU can designate a plurality of processes for the coprocessor at one time. As a result, there is the advantage that it is possible to reduce the number of times interrupts are issued to the CPU at the time of operation completion by the co-processor.
  • (Aspect 8)
  • The image processing device of any one of aspects 1-7, wherein the plurality of image processing sections are configured to operate independently and in parallel in accordance with commands from the CPU.
  • By enabling parallel operation of the plurality of image processing sections, parallel processing becomes possible at a task level, in accordance with commands from the CPU. Also, by writing process sequences at a processing unit and waiting unit into the descriptor, it is possible to efficiently carry out parallel processing at a task level.
  • (Aspect 9)
  • An image processing method provided with the following steps:
  • (1) a step of a frame memory storing image data that is to be processed;
  • (2) a step of a parallel memory receiving all or part of the image data that has been stored in the frame memory:
  • (3) a step of the plurality of image processing sections receiving all or part of the image data from the parallel memories or the frame memory, in accordance with instruction from a CPU; and
  • (4) a step of, in accordance with instruction from the CPU, respectively performing processing on all or part of the image data in accordance with a function for image processing.
  • (Aspect 10)
  • The image processing method of aspect 9, wherein dual port memory is used as the parallel memory, and further, the plurality of image processing sections perform pipeline processing with the parallel memory as a buffer, in accordance with instruction from the CPU.
  • By carrying out pipeline processing with parallel memory as a buffer, it becomes possible to make image processing even higher speed.
  • (Aspect 11)
  • The image processing method of aspects 9 or 10, wherein the plurality of image processing sections are configured to operate independently and in parallel in accordance with commands from the CPU, and the plurality of image processing sections also carry out parallel processing at a task level in accordance with instruction from the CPU.
  • By carrying out parallel processing at a task level, it can be expected to make image processing high-speed.
  • Effect of the Invention
  • According to the present invention, it is possible to execute high-speed image processing, and furthermore it becomes possible to provide an image processing device and image processing method where the burden on program creation is not excessive.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a schematic block diagram of an image processing device of one embodiment of the present invention.
  • FIG. 2 is a flowchart showing an overview of an image processing device using the device of FIG. 1.
  • FIG. 3 is a schematic hardware structure diagram of the device of FIG. 1.
  • EMBODIMENTS FOR CARRYING OUT THE INVENTION
  • An image processing device of one embodiment of the present invention will be described with reference to the attached drawings. This image processing device comprises, as main elements, coprocessors 11, 12, . . . 1P of P in number, frame memories 21, 22, . . . 2P of P in number, and a CPU 3. This device is further provided with a main memory 4, an I/O interface 5, a camera interface 6, a video interface 7, a CPU bus 8, and an inter-coprocessor bus 9.
  • Each frame memory 21 . . . is configured to store image data that will be processed. Specifically, with this embodiment, each frame memory is configured to store image data acquired from the camera interface 6 or the video interface 7. As illustrated in the drawings, each frame memory 21 . . . is provided in correspondence with each coprocessor 11 . . . .
  • The coprocessors 11 . . . are each provided with a plurality of direct memory access controllers (DMAC) 111, 112 . . . , 11N, a plurality of parallel memories 121, 122, . . . , 12M, and a plurality of processing units 13A, 13B, . . . , 13X. The specific internal structure of each coprocessor is the same in this embodiment, and so detailed description will only be given for the internal structure of coprocessor 11.
  • With this embodiment, the plurality of image processing sections of the present invention are constituted by the DMACs 111 . . . and the processing units 13A . . . . The DMACs and the processing units are not provided in one-to-one correspondence. In this specification, the fact that there are a plurality of processing units means that there are a priority of image processing sections. However, in the case where the DMACs 111 handle an image processing function there are a plurality of DMACs, and it is also possible to understand that there are a plurality of image processing sections.
  • The DMACs 111 . . . are configured to control operation of the parallel memories 121 . . . . However, with this embodiment, the DMAC 111 cooperates with the processing units 13A . . . so as to execute functions of the image processing.
  • The processing units 13A . . . are configured corresponding to functions for image processing.
  • The parallel memories 121 . . . acquire all or part of image data that has been stored in the frame memory 21, and transmit the data to any of the processing units 13 a . . . via the DMACs.
  • Also, dual port memory is used as the parallel memory 121 . . . of this embodiment.
  • The plurality of DMACs 111 . . . and processing unit sections 13A . . . of this embodiment each have a function corresponding to a function for image processing. However, it is also possible to have a structure where only the processing units 13A . . . handle this function.
  • The DMACs 111 . . . and the processing unit sections 13A . . . are configured to acquire all or part of image data from the parallel memories 121 . . . or the frame memory 21, in accordance with commands from the CPU. Further, the DMACs 111 . . . and the processing unit sections 13A . . . carry out image processing in accordance with a function for image processing on all or part of the image data.
  • The coprocessors 11 . . . of this embodiment are configured using reconfigurable programmable logic devices, specifically, so-called FPGAs. Accordingly, the number and capacity of parallel memories 121 . . . of the coprocessors 11 . . . , and the number and functions of the DMACs 111 . . . and the processing units 13A . . . , can be changed by rewriting the coprocessors 11 . . . .
  • The I/O interface 5 is a section for controlling input and output operations between external devices (not illustrated).
  • The camera interface 6 has a function for acquiring images from a camera (not shown).
  • The video interface 7 has a function for acquiring images from a video (not shown).
  • The CPU bus 8 is a bus for carrying out data transfer between the CPU and each of the co-processors 11 . . . .
  • The inter-coprocessor bus 9 is a bus for carrying out data transfer between each of the co-processors 11 . . . .
  • Each of the coprocessors 11 . . . is further provided with a descriptor 141. A descriptor 141 is a register-array for writing contents of image processing and direct memory access (DMA) in accordance with commands from the CPU 3. Specifically, the CPU 3 of this embodiment writes commands for the coprocessors 11 . . . to the descriptor 141.
  • The coprocessors 11 . . . read out commands written in the descriptor 141, and execute processing using the DMACs 111 . . . and the processing units 13A . . . (specifically, processing using the plurality of image processing sections).
  • (Image Processing Method)
  • Next, an image processing method that uses the image processing device of this embodiment will be described below with reference to FIG. 2.
  • (Step S-1 of FIG. 2)
  • First, image data constituting a subject of processing is acquired from the camera interface 6 or the video interface 7 in accordance with commands from the CPU 3.
  • (Step S-2 of FIG. 2)
  • Next, frame memories 21 . . . corresponding to coprocessors 11 . . . that will process the image store the image or part of the image. This processing is also carried out in accordance with commands from the CPU 3.
  • On the other hand, the CPU 3 writes commands for each of the coprocessors 11 . . . to a respective descriptor 141.
  • Specifically, the CPU 3 writes the following information (commands) to each descriptor 141.
      • Processing unit that will be used;
      • Parallel memory that will be used (for input and for output);
      • Parameters for processing;
      • Upon completion of processing, in which processing unit will processing commence (specifically, waiting unit).
    (Step S-3 of FIG. 2)
  • Next, each of the coprocessors 11 . . . reads out commands that have been written into the descriptor 141.
  • Specifically, each of the coprocessors 11 . . . reads out commands written in the descriptor 141, and assigns processing to each image processing section (DMAC and processing unit). Respective DMACs and processing units are operated independently and in parallel. For example, carrying out coordinate change while carrying out summation calculation is also possible.
  • If a descriptor system is used, it is possible to designate a plurality of processes in a coprocessor at one time, which means it is possible to reduce the number of times interrupts are issued to the CPU at the time of operation completion by the coprocessor.
  • (Step S-4 and S-5 of FIG. 2)
  • Next, the image processing sections acquire all or part of an image from the frame memories 21 . . . or from the parallel memory 121 . . . , and perform processing. This processing will be described in detail in the following.
  • A module for sorting processes being written to the descriptor (this module can be constructed within the descriptor, for example) operates as follows. In the description here, a processing unit includes a DMAC (in the case of having an image processing function).
      • 1) read next descriptor.
      • 2) if descriptor is empty halt processing.
      • 3) wait until processing unit it is intended to use and all waiting units become usable.
      • 4) distribute processing to processing unit.
      • 5) return to 1) above.
  • Here, in this embodiment, in the case where a waiting unit is not designated, the content of processing that has been written to a descriptor is immediately sent to the respective processing units. In the event that a waiting unit has been designated, the distribution of processing is not carried out until the designated unit is empty (until processing is complete). In the event that processing using the same unit is queued, the next processing is carried out upon completion of the previous processing
  • With the architecture of this embodiment, implementation of algorithms can be carried out using only a normal CPU programming environment. Processing units 13A . . . or DMACs 111 . . . for carrying out basic image processing are prepared in advance within the coprocessors 11 . . . , and by using these in combination it is possible to implement various algorithms. With this embodiment, since it is not necessary to carry out circuit design every time it is intended to change an algorithm, the burden on the user is reduced. Also, since with this embodiment it is possible to execute specific functions at high speed using the processing units 13A . . . or the DMACs 111 . . . , it is possible to carry out image processing at high speed.
  • With this embodiment, it is possible to write commands to the descriptor by calling an API (application programming interface) prepared within the program.
  • For example, the case of carrying out processing in a processing unit 13A upon completion of both the processing unit 13A and a processing unit 13B using a coprocessor 11 is as follows. In the following example, affixed characters such as UNIT_A1 and UNIT_A2 represent processing modules having the same function.
  • proc_A(CP 1, UNIT_A1, 0, . . . );
  • proc_B(CP 1, UNIT_B, 0, . . . );
  • proc —A(CP 1, UNIT_A2, UNIT_A1 UNIT_B, . . . );
  • Here, the function proc_X takes the following format. The function name represents the type of processing. cp represents a co-processor used, unit represents a processing unit used, and wunit represents a waiting unit. Besides this, memory and address to be used, image size and calculation parameters etc. are also designated by an argument(s).
  • proc_X(int cp, int unit, int wunits, . . . );
  • If sync( ) is called, firmware enters a wait state until a designated processing unit is empty. The coprocessor cp and the waiting unit wunit to be used are contained in the argument for sync( ).
  • proc_A(CP 1, UNIT —A1, . . . );
  • proc_B(CP 1, UNIT_B, . . . );
  • sync(CP 1, UNIT_A1|UNIT_B);
  • In a case where having only previously prepared functions is insufficient, as was described above, it is possible to prepare necessary functions by rewriting logic circuits within the FPGAs constituting the coprocessors 11 . . . . At this time, circuit design becomes necessary, but this is not for all coprocessors, and since it is possible to change on a unit by unit basis, the circuit design burden can be made as low as possible.
  • Further, since the basic structure of the processor is not changed even if the FPGA is rewritten, it is possible to maintain software compatibility as much as possible. For example, by calling units of the same function by the same function name and/or unit name, it is possible to keep changes to the existing code as small as possible.
  • The processing units 13A . . . are implemented as dedicated circuits for every process, and parallelization is achieved by concurrent execution of operations and pipelining. In addition, it is possible to allow processing for one pixel or more to be executed in one clock by simultaneously reading out a plurality of items of data from the parallel memories 121 . . . . This can be regarded as parallel processing within tasks.
  • Specifically, in this embodiment, it is possible to carry out parallel processing at the task level in accordance with instructions from the CPU (namely a program), using a plurality of image processing sections that operate in parallel.
  • On the other hand, in parallelizing processing at the task level, there are the following methods.
      • 1) using a plurality of units concurrently within the same co-processor.
      • 2) Using a plurality of co-processes concurrently.
  • In the case of using a plurality of units concurrently within the same coprocessor, it is not possible for the plurality of units to carry out simultaneous inputs or simultaneous output to/from the same memory (it will result in an execution time error). This is because internal memory of the FPGA does not have an arbitration function. However, if an arbitration function is implemented in the memory, concurrent access would become possible. For example, while summation calculation is being performed with INTRAM1, which is a parallel memory, as an input, it is not possible to use INTRAM1 as an input for scale conversion, but it is possible to make INTRAM2, which is another parallel memory, the input.
  • Also, in the case where there are two processing units for summation calculation, if INTRAM1 is being made an input at one of the processing units, INTRAM1 cannot also be made an input to the other processing unit.
  • On the other hand, with respect to a memory which is external to the FPGA, such as a PBSRAM (described later) or frame memory, since an arbitration function is generally implemented, concurrent access is possible.
  • When implementing an algorithm, as methods for dividing processing for parallelization, there are
  • a method of dividing a single image and respectively executing on separate units,
  • a method of respectively executing on separate units when the same processing is executed a plurality of times, such as template matching etc.,
  • and a method of pipelining processing at a task level, etc.
  • Generally, in the case where the same processing is distributed to a plurality of units, from the viewpoint of competing of resources it is preferable to assign processing to separate coprocessors. On the other hand, when there is a strong dependency between tasks, if transfer cost is taken into consideration it is preferable to assign to within the same co-processor.
  • Here, in the case where the parallel memories 121 . . . are constituted by dual port RAM, as with this embodiment, reading and writing can be carried out independently. As a result, even in the case where an output destination of a process is the input source of the next process, in the case where it is possible to confirm that the subsequent process does not overtake the previous process, it is possible to start the next process before all processing is complete.
  • In this way, compared to the case where the next process cannot be executed until the previous process is complete and all results are written to memory, it is possible to constitute a pipeline at a shorter stage, making it possible to contribute towards making processing high-speed. Specifically, with this embodiment, by using dual port memory as parallel memory, it is possible to carry out pipeline processing with memory as a buffer in accordance with CPU instructions (namely a program) and it becomes possible to make processing high speed.
  • A specific example of pipelining processing will be described later (later described table 4).
  • PRACTICAL EXAMPLE
  • Based on the above described architecture, the present inventors developed the evaluation system shown below. Results for system design and performance evaluation are shown.
  • A block diagram of the developed system is shown in FIG. 3. This system is implemented on a single mainboard on which two sub boards are mounted.
  • FPGAs, memory, an I/O port etc. are mounted on the main board, and the sub boards are a CPU board and a camera interface board. Correspondence between elements of the previously described practical example and the hardware of this practical example are shown below.
      • FPGA: coprocessor
      • DDR2DIMM: Frame memory
      • DIO, USB, RS-232C, Ethernet (Registered trademark): I/O interface
      • DDR-SDRAM: Main memory
      • EXT.BUS: Expansion bus.
  • Each FPGA is connected to the CPU bus by means of this expansion bus. Accordingly, functionally this expansion bus doubles as both an inter-coprocessor bus and a CPU bus.
  • The PBSRAM in FIG. 3 is not shown in FIG. 1, but is external memory for each FPGA.
  • With this practical example, in order to limit development costs, the CPU board uses a commercially available CPU substrate ESPT-Giga (trade name), and is connected to FPGAs within the mainboard through the expansion bus. ESPT-Giga has a Renesas SH7763 (SH-4A, 266 MHz) as a CPU, 64 MB DDR-SDRAM as memory, and as input/output is provided with 10/100/1000 BASE Ethernet (Registered trademark), USB1.1, and RS232C.
  • Here, ESPT-Giga is capable of having a built-in Web server function, and it is possible to operate the system from a PC through a web browser, and to display processing results. In this way, remote management using a LAN becomes possible, and it becomes possible to manage a plurality of systems with a single PC.
  • With this practical example, two Altera made EP2S60F1020C5 devices were mounted as FPGAs. The PBSRAM in FIG. 3 are flash memories (8 MBytes) for storing configuration data of the FPGAs, and they are provided for each FPGA.
  • The FPGAs have respective frame memories (DRAM), and an input image from a camera is automatically stored in the frame memory of FPGA1. A camera interface is connected to FPGA1. A physical interface for an individual camera is used implemented on a camera interface substrate that is attached on the mainboard. With the implementation of this practical example, the camera interface corresponds to a Basler made A504k (monochrome)/kc (color), and a Microtron made Eosens MC 1362 (monochrome)/1363 (color). These cameras are capable of real time output of images of a maximum of 1280×1024 pixels at 500 fps. Also, with these cameras, it is possible to raise the frame rate by reducing the number of lines of an image. For example, with 1280×512 pixels it is possible to output at 1000 fps.
  • The previously described A504k/kc and MC1362/1363 adopt an interface with which the CameraLink standard has been specially expanded, and are connected to the board with two CameraLink cables. These cameras are compatible at the physical layer with cameras with a normal CameraLink interface, and it is therefore also possible to handle other cameras by changing the circuitry of FPGA1. Further, with this practical example it is also made possible to handle other camera interfaces, such as IEEE1394 or Gigabit Ethernet (registered trademark) by changing the camera interface substrate.
  • Also, an analog VGA port is connected to FPGA1, making it possible to output images that have been stored in the frame memory to a display at SXGA size (1280×1024).
  • With the image processing device of this practical example, each FPGA has a small capacity SRAM that is separate from the frame memory. With this practical example, this SRAM can be used as an input source for coordinate transform processing which will be described later.
  • Further with this practical example, 32-bit digital I/O (DIO) is provided separately from the external input output function using ESPT-Giga, and can be used in applications that require a high real-time capability such as robot control.
  • Operational speed between each module of this practical example is as follows.
    • between FPGA-DRAM: 3200 MB/s (200 MHz, 128 bit);
    • between FPGA-SRAM: 1333 MB/s (166.6 MHz, 64 bit);
    • between FPGA-CPU: 133 MB/s (66.67 MHz, 16 bit);
    • between FPGA-FPGA: 2133 MB/s (133.33 MHz, 128 bit).
  • Specifications for the boards of this practical example described above are collectively shown in table 1 below.
  • TABLE 1
    CPU Renesas SH7763 (SH-4A, 266 MHz)
    OS Renesas HI7750/4 (μITRON 4.0)
    FPGA Altera Stratix II EP2S60F1020C5 × 2
    Memory DDR SDRAM 64 MB (CPU)
    DDR2 DIMM 2 GB (Frame)
    DDR2 SDRAM 128 MB (Frame)
    PBSRAM 8 MB × 2 (FPGA local)
    Supported Camera Basler A504k/kc
    1280 × 1024 pixels @ 500 fps
    1280 × 512 pixels @ 1000 fps
    Video Output SXGA (1280 × 1024)
    I/Os 10/100/1000BASE Ethernet
    USB1.1, 32 bit DIO, RS-232C
    Dimensions 257 mm × 182 mm × 58 mm
    Power Supply 12 V, 5 A
  • With this practical example, a co-processor that was described for the previous embodiment is implemented in each of the FPGAs. Each co-processor has the following units.
      • 2 parallel memories (64 kbytes) (PAR1, PAR4);
      • 4 parallel memories (16 kbytes) (PAR2, PAR3, PAR5, PAR6);
      • 2 DMA control units (DMA1, DMA2);
        • 9 processing units (refer to table 2 and table 3 that will be described later);
      • Command descriptor.
  • With the coprocessors of this practical example, one pixel is processed as 16 bits. Images sent from a camera are most often 8 bit, but 16 bit is made standard since there is a need for greater precision during processing for calculation. For example, results for the case of adding or subtracting 8-bit images are 9 bit. In the case of summing a lot of images with weighting, such as filter processing etc., there is a need for a greater number of bits than this.
  • Processing for color images is handled as three independent grayscale images for RGB respectively. At the time of carrying out calculation at a processing unit for an image, it is possible to designate handling of 16-bit input data and handling of 16-bit output data respectively, as follows.
    • Input data is
    • 1) interpreted as 0x0000 to 0xffff (unsigned);
    • 2) interpreted as −0x8000 to 0x7fff (signed).
    • calculation results are
    • 1) represented by 0x0000 to 0xffff;
    • 2) represented by −0x8000 to 0x7fff;
    • 3) made into an absolute value and represented by 0x0000 to 0xffff.
  • If a calculation result does not fit into 16 bits either a maximum value or a minimum value is made saturated.
  • In the event that binarization is carried out, if the relationship Tl≦y≦Th is satisfied a result y is made 1, and otherwise is made 0. Here, Tl and Th represent upper limit and lower limit of an appropriately set threshold value.
  • Coefficient parameters for image processing are set to 16 bit length or 12 bit length signed fixed point, and position of the decimal point is designated in common for each parameter.
  • Parallel memory can simultaneously read and write data of 128 bits (8 pixels) at a time. Also, this parallel memory is constituted by dual port RAM, and it is possible to carry out reading and writing independently.
  • The DMA control unit (DMAC) carries out transfer of data between each memory. With this practical example, transfer of data with the CPU is only possible with a specific DMAC (for example DMA2) in each FPGA. Also, the only device able to transfer the data to another FPGA is another specific DMAC in each FPGA (for example, it is made DMA1). Data transfer between each of the memories is carried out in 128 bit units, but when transferring data with an external memory, transfer is limited to the operating speed of the external memory.
  • As a data transfer range,
  • setting of (number of transferred bytes per line)×(number of lines); and
  • transfer start address for each of the source and destination, and Address increment per line,
  • are designated. In this way clipping and embedding of part of a large screen image is possible.
  • In the DMA control unit (DMAC) of this practical example, it is possible to optionally provide the following data operation circuits (namely image processing functions).
  • Shift Circuit
  • This circuit outputs a result of data that has been shifted in byte units to the left, every 16 bytes. Regarding src address for data, there is a limitation that it must be a multiple of 16, but if a shift circuit is used it is possible to make data of an arbitrary address the src.
  • Thinning Circuit
  • This circuit receives data input every 16 bytes and output data that has been thinned by either 8→1 (⅛th of the output data amount), 4→1 (¼ of the output data amount), or 2→1 (½ of the output data amount). It is possible to carry out image compression using this function and designation of DMA transfer address increment.
  • Conversion Circuit
  • With this circuit data is input every 16 bytes. Conversion can then be carried out either of 8 bit (monochrome)→16-bit (double the output data amount), 8 bit (Bayer)→16 bit (either of an R component, G component or B component), or 16-bit→8-bit (½ the output data amount). Data sent from a camera has a single pixel constituted by 8 bits, and in order to process this in a co-processor, it is necessary to convert to one pixel of 16 bits. In the case of a color camera, Bayer conversion is carried out and processing is carried out to output only one component from among RGB.
  • Table 2 shows calculation units (that is, processing units) implemented in each coprocessor.
  • TABLE 2
    pixels number
    per of
    name function clock units
    SCALE O(x, y) = a1I(x, y) + a2 8 2
    ARITH O(x, y) = a1I1(x, y)I2(x, y) + 8 2
    a2I1(x, y) + a3I2(x, y) + a4
    CONV O(x, y) = F * I(x, y) F = ( a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 ) 4 1
    SUM S = x y I ( x , y ) 8 2
    AFFINE O(x, y) = 1 2
    I(a1x + a2y + a3, a4x + a5y + a6)
  • The meanings of symbols in this table are as follows:
    • SCALE: Processing to linearly scale a pixel value;
    • ARITH: Processing to carry out addition/subtraction and/or multiplication between 2 images;
    • 3×3CONV: conversion filter of operator kernel size 3×3;
    • SUM: Processing to calculate a sum of pixel values in a designated range;
    • AFFINE: Processing to convert an image with planar affine transformation.
  • Here, AFFINE can receive input from the SRAM. SCALE, ARITH and SUM can perform processing simultaneously for 8 pixels in 1 clock, while 3×3CONV can carry out processing simultaneously for 4 pixels in 1 clock. AFFINE carries out processing for only 1 pixel in 1 clock
  • With this practical example, as processing units for image processing, basic functions are equipped that are considered to be the minimum requirement for executing generally commonly used image processing algorithms. Besides these, “nonlinear scale transform using a lookup table (LUT)”, “nonlinear coordinate transform using a lookup table (LUT)”, and “logical operations” etc. may be included as functions that are useful to implement in the process units.
  • The descriptors of this practical example can be respectively stored with instructions up to 256 Words (1 Word=2 bytes). One instruction is made up of 1 to 3 Words depending on the number of parameters. Here, a single instruction corresponds to a single previously described proc_X( ) function, and it is possible to instruct processing for image data of a designated range using a single DMA control unit or image processing unit. An instruction to do nothing at all is also provided, and this corresponds to the sync( ) function.
  • (Specifications such as Operating Frequency, Circuit Size etc.)
  • The FPGAs used in this practical example operate at 200 MHz, and used resources are 88% for FPGA1 and 81% for FPGA2.
  • Table 3 shows processing that uses the basic functions of the system of this practical example, and computing time for processing that combines basic functions. For the purposes of comparison, computing time for the case where the same processing is implemented on a PC using OpenCV is also shown. The PC used had an Intel E6300 (1.86 GHz×2) CPU, 3 GB of RAM, implementation using Visual Studio 2005 and OpenCV V1.0, and was measured on Windows (registered trademark) XP. EvalSys in the table shows processing time for the case of using the developed evaluation system, and OpenCV shows processing time for the case of using a PC and OpenCV.
  • TABLE 3
    time (ns/pixel)
    EvalSys OpenCV ratio
    algorithm (TA) (TB) (TB/TA)
    Basic Image Processing
    Copy† 1.10 0.51 0.46
    Bayer Conversion† 2.68 5.15 1.92
    Shrink† 2.16 1.87 0.87
    Shift 0.76 0.33 0.43
    Scaling 0.64 6.14 9.59
    Arithmetic Operation 0.76 13.6 17.9
    3 × 3 Convolution 1.41 18.7 13.2
    Summation 0.90 0.58 0.64
    Affine Transformation 5.16 13.6 2.64
    Complex Image Processing
    Centroid Detection 2.34 10.2 4.36
    †From frame memory
  • In this practical example, input source and output destination are set to parallel memory within the FPGA, and image size is set to 256×32.
  • On the other hand, for measurement with the PC the same processing was repeatedly carried out 100,000 times, and average execution time was used. Implementation was carried out using the OpenCV library, and an image was set to signed 16-bit. However, with respect to a Bayer transform or an affine transform, OpenCV functions do not support signed 16-bit, and so were implemented as unsigned 8-bit. Coefficients used float or double in accordance with the function specification. The 3 functions of Copy, Bayer Conversion and Shrink are assumed to be carried out initially for an image that has been acquired from a camera, and so input source is set to frame memory, and measurement on a PC excludes the influence of caching immediately before processing. Also, Bayer Conversion and Shrink have different image sizes between input and output, but computation was performed on the basis of output image size.
  • Centroid computation first extracts a region, in which a subject is predicted to exist from results of a previous frame, from the input image, and binarizes with a fixed or adaptively defined threshold value. Next, computation of a centroid is carried out using the following equations.
  • m 00 = x , y I ( x , y ) ( 1 ) m 10 = x , y xI ( x , y ) , m 01 = x , y yI ( x , y ) ( 2 ) x c = m 10 / m 00 , y c = m 01 m 00 ( 3 )
  • Weights Ix and Iy at an (x, y) coordinate value are respectively input to parallel memory in advance, the binarized input image is weighted using ARITH, and moments m10 and m01 are computed by summation calculation using SUM. A moment m00 is obtained my summation calculation without weighting.
  • These processes are pipelined at the task level in the form shown in table 4, and executed in parallel. Since the parallel memory is of dual port type, in the case where an output destination of a process becomes the input source of the next process, it is possible to start the next processing before completion of all processing. All of the previously described processing of this practical example was executed only by the coprocessors 1.
  • TABLE 4
    processing 1 processing 2 processing 3 processing 4
    step 1 Binarization Copy Weighting (x) Sum (m10)
    UNIT SCALE1 SCALE2 ARITH1 SUM1
    IN1 PAR1 PAR2 PAR3 PAR5
    IN2 PAR4 (Ix)
    OUT PAR2 PAR3 PAR5
    step
    2 Weighting (y) Sum (m01) Sum (m00)
    UNIT ARITH1 SUM2 SUM1
    IN1 PAR2 PAR5 PAR3
    IN2 PAR4 (Iy)
    OUT PAR5
  • It will be understood from these results, that according to the present invention performance that is considerably improved over a PC is uniformly obtained. However, the purpose of comparing to a PC this time is to give a benchmark when comparing to other general methods, and the reason that a PC-based system is not used is in order to ensure stability and reliability. Actually, with a system that uses a PC, delay and frame dropping are observed frequently, which constitutes a hindrance to high-speed vision applications.
  • In the case where a PC is used also, it is possible to increase speed using multimedia commands or utilizing a GPU, and in this case it is possible to realize higher computation performance than the developed system. For example, with GPUCV (non-patent publication 30) which is one framework that uses a GPU, processing performance of 1.7-18 times that of a PC is reported in some image processing. However, these assume the PC platform, and can be expected to face the above-described problems.
  • The developed board operates with a 12V, 5 A power supply, and effective power consumption is about 42 W. Because the power consumption of an FPGA is comparatively high, power consumption becomes high compared to the case of using a DSP or the like, but with this practical example there is the advantage that by using a built-in system it is possible to ensure stability and reliability
  • With this practical example, as an architecture for a high-speed vision system, there is the advantage that by combining embedded microprocessors and FPGAs it is possible to combine both hardware reconfigurability and ease of algorithm implementation.
  • With the exponential advancement of semiconductor integration, processing architecture that makes practical use of parallel processing will become much more important in the future. At that time it will be difficult to make optimum use of the degree of parallelism with a generic architecture. By combining reconfigurable dedicated circuits using FPGAs and the simple programming environment of a CPU, it is possible to exhibit high image processing performance while having a certain degree of general versatility.
  • The reconfigurable programmable logic devices are integrated circuits normally referred to as FPGAs. By using an FPGA it is possible to rewrite functions of the image processing sections according to the user's needs. For example, it is possible to add image processing sections corresponding to deficient functions, or to add image processing sections corresponding to required functions.
  • With this practical example, by using dual port memory as parallel memory, it is possible to carry out reading from and writing to memory independently. It is therefore possible to make processing even higher speed.
  • Also, by using the descriptor, the CPU can designate a plurality of processes for the coprocessor at one time. As a result, there is the advantage that it is possible to reduce the number of times interrupts are issued to the CPU at the time of operation completion by the coprocessors.
  • The image processing sections correspond to specific functions used in image processing. In the case of carrying out image processing, processing can be made high speed by carrying out execution of functions required for processing in dedicated image processing sections. Further, in a program, it is possible to execute processing by designating specific functions or image processing sections.
  • Thus, while the invention has been described with reference to specific embodiments, the description is illustrative of the invention is not to be construed as limiting the invention.

Claims (11)

1. An image processing device, comprising a coprocessor, frame memory and a CPU,
the frame memory being configured to store image data that is to be processed,
the coprocessor comprising a plurality of image processing sections and a plurality of parallel memories,
the parallel memory being configured to receive all or part of the image data that has been stored in the frame memory and transmitting to any of the image processing sections,
the plurality of image processing sections each having a function corresponding to a function for image processing, and
the plurality of image processing sections being configured to, in accordance with instruction from the CPU, receive all or part of the image data from the parallel memories or the frame memory, and perform processing on all or part of the image data in accordance with a function for the image processing.
2. The image processing device of claim 1, wherein the coprocessors are configured using reconfigurable programmable logic devices.
3. The image processing device of claim 1, wherein the plurality of parallel memories are dual port memories.
4. The image processing device of claim 1, wherein the image processing sections comprise direct memory access controllers and processing units, the direct memory access controllers being configured to control operation of the parallel memory, and the processing units being configured to carry out processing in accordance with a function for the image processing.
5. The image processing device of claim 1, wherein a plurality of the coprocessors are provided.
6. The image processing device of claim 5, wherein the plurality of coprocessors are connected to a shared coprocessor bus.
7. The image processing device of claim 1, wherein the coprocessor is further provided with a descriptor, the CPU being configured to write commands for a coprocessor to the descriptor, and the coprocessor being configured to read commands that have been written to the descriptor, and execute processing using the plurality of image processing sections.
8. The image processing device of claim 1, wherein the plurality of image processing sections are configured to operate independently and in parallel in accordance with commands from the CPU.
9. An image processing method provided with the following steps:
(1) a step of a frame memory storing image data that is to be processed;
(2) a step of a parallel memory receiving all or part of the image data that has been stored in the frame memory: (3) a step of the plurality of image processing sections receiving all or part of the image data from the parallel memories or the frame memory, in accordance with instruction from a CPU; and
(4) a step of, in accordance with instruction from the CPU, respectively performing processing on all or part of the image data in accordance with a function for image processing.
10. The image processing method of claim 9, wherein dual port memory is used as the parallel memory, and further, the plurality of image processing sections perform pipeline processing with the parallel memory as a buffer, in accordance with instruction from the CPU.
11. The image processing method of claim 9, wherein the plurality of image processing sections are configured to operate independently and in parallel in accordance with commands from the CPU, and the plurality of image processing sections also carry out parallel processing at a task level in accordance with instruction from the CPU.
US13/392,510 2009-08-26 2010-08-13 Image processing device and image processing method Abandoned US20120147016A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
JP2009195777A JP2011048579A (en) 2009-08-26 2009-08-26 Image processor and image processing method
JP2009-195777 2009-08-26
PCT/JP2010/063740 WO2011024654A1 (en) 2009-08-26 2010-08-13 Image processing device and image processing method

Publications (1)

Publication Number Publication Date
US20120147016A1 true US20120147016A1 (en) 2012-06-14

Family

ID=43627760

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/392,510 Abandoned US20120147016A1 (en) 2009-08-26 2010-08-13 Image processing device and image processing method

Country Status (5)

Country Link
US (1) US20120147016A1 (en)
EP (1) EP2472468A4 (en)
JP (1) JP2011048579A (en)
CN (1) CN102483842A (en)
WO (1) WO2011024654A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020972A (en) * 2012-12-28 2013-04-03 中国科学院长春光学精密机械与物理研究所 Embedded processor based binary image connected domain detecting method
US20140022267A1 (en) * 2012-07-19 2014-01-23 Samsung Electronics Co., Ltd. Method and system for accelerating collision resolution on a reconfigurable processor
DE102015204899A1 (en) 2014-03-19 2015-09-24 Denso Corporation Data processing device
US20160171922A1 (en) * 2014-12-15 2016-06-16 Wai Hung Lee Controller for persistent display panel
CN110189244A (en) * 2019-06-06 2019-08-30 卡瓦科尔牙科医疗器械(苏州)有限公司 Acceleration image processing system for CT images equipment
CN111064906A (en) * 2019-11-27 2020-04-24 北京计算机技术及应用研究所 Domestic processor and domestic FPGA multi-path 4K high-definition video comprehensive display method
CN112001836A (en) * 2020-07-03 2020-11-27 北京博雅慧视智能技术研究院有限公司 Image processing device

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8803897B2 (en) * 2009-09-03 2014-08-12 Advanced Micro Devices, Inc. Internal, processing-unit memory for general-purpose use
JP5653865B2 (en) * 2011-08-23 2015-01-14 日本電信電話株式会社 Data processing system
CN102446342B (en) * 2011-08-30 2013-04-17 西安交通大学 Reconfigurable binary arithmetical unit, reconfigurable binary image processing system and basic morphological algorithm implementation method thereof
JP5876319B2 (en) * 2012-02-21 2016-03-02 日本電信電話株式会社 Service providing system, service providing method, resource manager, program
CN103018515B (en) * 2012-12-12 2014-09-24 电子科技大学 Digital oscilloscope with seamless measuring capability
CN104200467B (en) * 2014-08-25 2017-02-15 西安交通大学 Reconfigurable gray-scale morphological image processor as well as gray-scale operation circuit and morphological operation realizing method thereof
EP3400864A4 (en) * 2016-01-04 2019-08-28 Shenzhen Mindray Bio-Medical Electronics Co., Ltd System and method for controlling coordination between medical devices, medical workstation and communication device
WO2017163591A1 (en) * 2016-03-24 2017-09-28 富士フイルム株式会社 Image processing device, image processing method, and image processing program
CN107766021B (en) * 2017-09-27 2020-12-25 芯启源(上海)半导体科技有限公司 Image processing method, image processing apparatus, display system, and storage medium
CN107992100B (en) * 2017-12-13 2021-01-15 中国科学院长春光学精密机械与物理研究所 High frame rate image tracking method and system based on programmable logic array

Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4797809A (en) * 1983-09-07 1989-01-10 Ricoh Company, Ltd. Direct memory access device for multidimensional data transfers
US5121480A (en) * 1988-07-18 1992-06-09 Western Digital Corporation Data recording system buffer management and multiple host interface control
US5392391A (en) * 1991-10-18 1995-02-21 Lsi Logic Corporation High performance graphics applications controller
US5584032A (en) * 1984-10-17 1996-12-10 Hyatt; Gilbert P. Kernel processor system
US5835101A (en) * 1996-04-10 1998-11-10 Fujitsu Limited Image information processing apparatus having means for uniting virtual space and real space
US5923893A (en) * 1997-09-05 1999-07-13 Motorola, Inc. Method and apparatus for interfacing a processor to a coprocessor
US6061749A (en) * 1997-04-30 2000-05-09 Canon Kabushiki Kaisha Transformation of a first dataword received from a FIFO into an input register and subsequent dataword from the FIFO into a normalized output dataword
US20020054229A1 (en) * 2000-11-06 2002-05-09 Mega Chips Corporation Image processing circuit
US20030113031A1 (en) * 1997-04-15 2003-06-19 Wal Gooitzen Siemen Van Der Parallel pipeline image processing system
US20030135535A1 (en) * 2002-01-11 2003-07-17 Hoeflinger Jay P. Transferring data between threads in a multiprocessing computer system
US20030142873A1 (en) * 2001-09-21 2003-07-31 Gauthier Lafruit 2D FIFO device and method for use in block based coding applications
US6657621B2 (en) * 2001-05-01 2003-12-02 Hewlett-Packard Development Company, L.P. Device and method for scrolling stored images across a display
US20070030276A1 (en) * 1998-11-09 2007-02-08 Macinnis Alexander G Video and graphics system with parallel processing of graphics windows
US20070091097A1 (en) * 2005-10-18 2007-04-26 Via Technologies, Inc. Method and system for synchronizing parallel engines in a graphics processing unit
US20080055326A1 (en) * 2006-09-05 2008-03-06 Yun Du Processing of Command Sub-Lists by Multiple Graphics Processing Units
US20080303838A1 (en) * 2007-06-07 2008-12-11 Yamaha Corporation Image processing apparatus
US20090010363A1 (en) * 2007-06-26 2009-01-08 Kaoru Kobayashi Matched filter
US20090119491A1 (en) * 2006-04-05 2009-05-07 Nec Corporation Data processing device
US20100102849A1 (en) * 2008-10-27 2010-04-29 Fuji Xerox Co., Ltd. Electronic device, method for configuring reprogrammable logic element, computer-readable medium, computer data signal and image forming apparatus
US20100253690A1 (en) * 2009-04-02 2010-10-07 Sony Computer Intertainment America Inc. Dynamic context switching between architecturally distinct graphics processors
US7950003B1 (en) * 2006-12-07 2011-05-24 Sony Computer Entertainment Inc. Heads-up-display software development tool for analyzing and optimizing computer software
US8310482B1 (en) * 2008-12-01 2012-11-13 Nvidia Corporation Distributed calculation of plane equations

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5301344A (en) * 1991-01-29 1994-04-05 Analogic Corporation Multibus sequential processor to perform in parallel a plurality of reconfigurable logic operations on a plurality of data sets
JP3297925B2 (en) * 1991-09-12 2002-07-02 ソニー株式会社 Signal processing processor
US5808690A (en) * 1996-01-02 1998-09-15 Integrated Device Technology, Inc. Image generation system, methods and computer program products using distributed processing
JP4298006B2 (en) * 1997-04-30 2009-07-15 キヤノン株式会社 Image processor and image processing method thereof
US7577822B2 (en) * 2001-12-14 2009-08-18 Pact Xpp Technologies Ag Parallel task operation in processor and reconfigurable coprocessor configured based on information in link list including termination information for synchronization
US7564996B2 (en) * 2003-01-17 2009-07-21 Parimics, Inc. Method and apparatus for image processing
US7015915B1 (en) * 2003-08-12 2006-03-21 Nvidia Corporation Programming multiple chips from a command buffer
WO2006026086A2 (en) * 2004-08-31 2006-03-09 Silicon Optix Method and apparatus for management of bit plane resources

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4797809A (en) * 1983-09-07 1989-01-10 Ricoh Company, Ltd. Direct memory access device for multidimensional data transfers
US5584032A (en) * 1984-10-17 1996-12-10 Hyatt; Gilbert P. Kernel processor system
US5121480A (en) * 1988-07-18 1992-06-09 Western Digital Corporation Data recording system buffer management and multiple host interface control
US5392391A (en) * 1991-10-18 1995-02-21 Lsi Logic Corporation High performance graphics applications controller
US5835101A (en) * 1996-04-10 1998-11-10 Fujitsu Limited Image information processing apparatus having means for uniting virtual space and real space
US20030113031A1 (en) * 1997-04-15 2003-06-19 Wal Gooitzen Siemen Van Der Parallel pipeline image processing system
US6061749A (en) * 1997-04-30 2000-05-09 Canon Kabushiki Kaisha Transformation of a first dataword received from a FIFO into an input register and subsequent dataword from the FIFO into a normalized output dataword
US5923893A (en) * 1997-09-05 1999-07-13 Motorola, Inc. Method and apparatus for interfacing a processor to a coprocessor
US20070030276A1 (en) * 1998-11-09 2007-02-08 Macinnis Alexander G Video and graphics system with parallel processing of graphics windows
US20020054229A1 (en) * 2000-11-06 2002-05-09 Mega Chips Corporation Image processing circuit
US6657621B2 (en) * 2001-05-01 2003-12-02 Hewlett-Packard Development Company, L.P. Device and method for scrolling stored images across a display
US20030142873A1 (en) * 2001-09-21 2003-07-31 Gauthier Lafruit 2D FIFO device and method for use in block based coding applications
US20030135535A1 (en) * 2002-01-11 2003-07-17 Hoeflinger Jay P. Transferring data between threads in a multiprocessing computer system
US20070091097A1 (en) * 2005-10-18 2007-04-26 Via Technologies, Inc. Method and system for synchronizing parallel engines in a graphics processing unit
US20090119491A1 (en) * 2006-04-05 2009-05-07 Nec Corporation Data processing device
US20080055326A1 (en) * 2006-09-05 2008-03-06 Yun Du Processing of Command Sub-Lists by Multiple Graphics Processing Units
US7950003B1 (en) * 2006-12-07 2011-05-24 Sony Computer Entertainment Inc. Heads-up-display software development tool for analyzing and optimizing computer software
US20080303838A1 (en) * 2007-06-07 2008-12-11 Yamaha Corporation Image processing apparatus
US20090010363A1 (en) * 2007-06-26 2009-01-08 Kaoru Kobayashi Matched filter
US20100102849A1 (en) * 2008-10-27 2010-04-29 Fuji Xerox Co., Ltd. Electronic device, method for configuring reprogrammable logic element, computer-readable medium, computer data signal and image forming apparatus
US8310482B1 (en) * 2008-12-01 2012-11-13 Nvidia Corporation Distributed calculation of plane equations
US20100253690A1 (en) * 2009-04-02 2010-10-07 Sony Computer Intertainment America Inc. Dynamic context switching between architecturally distinct graphics processors

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140022267A1 (en) * 2012-07-19 2014-01-23 Samsung Electronics Co., Ltd. Method and system for accelerating collision resolution on a reconfigurable processor
US9098917B2 (en) * 2012-07-19 2015-08-04 Samsung Electronics Co., Ltd. Method and system for accelerating collision resolution on a reconfigurable processor
CN103020972A (en) * 2012-12-28 2013-04-03 中国科学院长春光学精密机械与物理研究所 Embedded processor based binary image connected domain detecting method
DE102015204899A1 (en) 2014-03-19 2015-09-24 Denso Corporation Data processing device
US9747232B2 (en) 2014-03-19 2017-08-29 Denso Corporation Data processing device
US20160171922A1 (en) * 2014-12-15 2016-06-16 Wai Hung Lee Controller for persistent display panel
US9679523B2 (en) * 2014-12-15 2017-06-13 Nxp Usa, Inc. Controller for persistent display panel with SIMD module that transposes waveform data
CN110189244A (en) * 2019-06-06 2019-08-30 卡瓦科尔牙科医疗器械(苏州)有限公司 Acceleration image processing system for CT images equipment
CN111064906A (en) * 2019-11-27 2020-04-24 北京计算机技术及应用研究所 Domestic processor and domestic FPGA multi-path 4K high-definition video comprehensive display method
CN112001836A (en) * 2020-07-03 2020-11-27 北京博雅慧视智能技术研究院有限公司 Image processing device

Also Published As

Publication number Publication date
JP2011048579A (en) 2011-03-10
EP2472468A1 (en) 2012-07-04
CN102483842A (en) 2012-05-30
WO2011024654A1 (en) 2011-03-03
EP2472468A4 (en) 2014-09-03

Similar Documents

Publication Publication Date Title
US20120147016A1 (en) Image processing device and image processing method
US11859973B2 (en) Large scale CNN regression based localization via two-dimensional map
US11048970B2 (en) Look-up convolutional layer in convolutional neural network
US11640297B2 (en) Instruction and logic for systolic dot product with accumulate
US11010302B2 (en) General purpose input/output data capture and neural cache system for autonomous machines
WO2017107168A1 (en) Event-driven framework for gpu programming
KR20160134713A (en) Hardware-based atomic operations for supporting inter-task communication
US10565670B2 (en) Graphics processor register renaming mechanism
WO2017058448A1 (en) Dense optical flow acceleration
US20170010894A1 (en) Dynamic thread splitting
WO2018045551A1 (en) Training and deploying pose regressions in neural networks in autonomous machines
US9183611B2 (en) Apparatus implementing instructions that impose pipeline interdependencies
EP3948791A1 (en) General purpose register and wave slot allocation in graphics processing
WO2018112782A1 (en) Camera re-localization by enhanced neural regression using middle layer features in autonomous machines
US9148544B2 (en) System, process, and computer program product for implementing a document scanner in a hand-held device
WO2018044437A1 (en) Mechanism to increase thread parallelism in a graphics processor
Komuro et al. A reconfigurable embedded system for 1000 f/s real-time vision
JP2014160516A (en) Image processor and image processing method
TW201810026A (en) Extension of register files for local processing of data in computing environments
US20140176440A1 (en) Apparatus and system for implementing a wireless mouse using a hand-held device
US20200344378A1 (en) Buffer management for plug-in architectures in computation graph structures
WO2024001699A1 (en) Method for processing shader input data, and graphics processing apparatus
Folkers et al. High performance realtime vision for mobile robots on the GPU.
Rahman et al. Parallel image processing system-a modular architecture using dedicated image processing modules and a graphics processor
Brown et al. GPU-accelerated 3-D model-based tracking

Legal Events

Date Code Title Description
AS Assignment

Owner name: THE UNIVERSITY OF TOKYO, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ISHIKAWA, MASATOSHI;KOMURO, TAKASHI;TABATA, TOMOHIRA;SIGNING DATES FROM 20120217 TO 20120220;REEL/FRAME:027784/0799

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION