WO2002001488A1

WO2002001488A1 - Use of image detail to select and classify variably-sized pixel blocks for motion estimation

Info

Publication number: WO2002001488A1
Application number: PCT/US2001/041183
Authority: WO
Inventors: Alan S. Rojer
Original assignee: Rojer Alan S
Priority date: 2000-06-26
Filing date: 2001-06-26
Publication date: 2002-01-03
Also published as: AU2001279269A1

Abstract

In a digital video motion estimation subsystem (1), pixel block-matching is preceded by an analysis of block detail (40) to dynamically select variably-sized pixel blocks for matching. A metric for quantification of block detail is provided. Given an externally specified level of required detail (22), blocks are recursively subdivided as long as at least one subdivision child retains sufficient detail (60). The variably-sized blocks are classified into 'sharp' (91) or high-detail blocks, and 'flat' (92) of low-detail blocks. Flat blocks are prone to spurious matches, while sharp blocks are more likely to match unambiguously or to fail to match due to occlusion or changes of scene content. Block-matching (search) resources may then be concentrated on sharp blocks.

Description

USE OF IMAGE DETAIL TO

SELECT AND CLASSIFY VARIABLY-SIZED PIXEL BLOCKS

FOR MOTION ESTIMATION

Field of the Invention

The present invention relates to computer-implemented processes and apparatus

for efficient matching of blocks of pixels between distinct frames in a video

sequence.

Background of the Invention and Description of the Prior Art

Pixel block-matching is a crucial component of many processes and apparatus in

video motion estimation. The primary application of video motion estimation is

video compression for efficient transmission over low-bandwidth channels. Other

applications include video frame-rate conversion, temporal interpolation, and noise

removal.

Motion estimation is a prerequisite to exploitation of temporal redundancy in a

digital video signal. Successive frames in a video sequence are likely to possess

substantially similar visual information content. The most widely used motion

estimation techniques in the prior art utilize matching between identically sized

blocks of pixels in temporally adjacent frames. Most matching techniques have in

common a method for evaluation of a possible match, typically using a sum of absolute pixel differences or sum of squared differences of pixel differences.

The most expensive technique for matching is brute force search of all possible

candidates, usually with some absolute limit on window size. Many block-matching

techniques utilize strategies to limit the number of possible matches to be

considered. A typical strategies include successive refinement of searches from a

wide, coarse grid of candidates to a narrow, fine grid (e.g., US. Pat. No.

5,706,059, incorporated herein by reference). Another popular strategy is to use

lower-resolution versions of the source image for coarse matching, with the results

of low-cost coarse matching serving as initialization of high-cost, high resolution

fine matching (e.g., US. Pat. No. 5,801 ,014 incorporated herein by reference).

All block matching techniques suffer from several intrinsic shortcomings. Most

significantly, there may be many " target" image blocks (at a variety of locations

in the target image) which will match a ^{Λ *} source" image block equally well. This

may be easily seen for a block which is of constant intensity throughout; such a

block will match with perfect fidelity anywhere throughout a region of said

intensity. Less obvious but in many ways more pernicious is that a block with only

a strictly linear feature (e.g., a straight edge or a constant intensity gradient) will

match ambiguously anywhere along the edge, allowing identification of only one of

two motion components. In the closely related context of optic flow estimation, it's

well-known in the prior art that only regions of sufficient Gaussian curvature (i.e.,

image detail) can unambiguous motion estimation take place. This problem has been known for many years as the aperture problem (cf., S. S. Beauchemin and J. L.

Barron, "The Computation of Optic Flow", ACM Computing Surveys, Vol. 27, No.

8, Sept. 1 995, pp. 433-467, incorporated herein by reference). Of course, as the

block size is increased, the likelihood that a block will contain sufficient detail for

an unambiguous match also increases.

Another shortcoming of block search method is their poor handling of natural

motion borders in real scenes. Rarely do such borders fall on an orderly block grid.

In a typical scene, many blocks will straddle regions of disparate motion. In these

cases, the best match may be a motion vector corresponding to either region, or

a compromise between the vectors. In this situation, the use of smaller blocks will

usually reduce the degree of overlap, permitting better resolution of the motion

edge.

In effect, conventional block matching techniques must select a block size for

matching which compromises between these incompatible constraints: larger blocks

for less matching ambiguity, smaller blocks for higher precision in detection and

representation of motion borders in the scene.

Several inventors have addressed aspects of this unpleasant tradeoff in the prior

art.

The importance of detail in block matching is acknowledged in several inventions. Astle (US. Pat. No. 6,020,926), incorporated herein by reference, teaches a

technique for restricting the error computation in a potential block match mainly to

pixels in regions of high luminance gradient, for the purpose of reducing the

expense of a block comparison. There is no concept of varying the block size to

accommodate the uneven distribution of detail around a scene, however. Reitmeier

(US Pat. No. 5,987, 1 80), incorporated herein by reference, teaches the use of

chrominance to make up for missing luma detail. Kundu (US Pat. No. 5,974, 1 92),

incorporated herein by reference, teaches a categorization of pixels according to

qualitative characteristics which recognizes the ambiguity of matching where

textural information is lacking. Closely related is Jung (US Pat. No. 5,808,685),

incorporated herein by reference, which weights error signals using local pixel

variance or gradient estimation.

The use of variably-sized search blocks may also be found in the prior art. Zhang

et al. (US Pat. No. 5,477,272), incorporated herein by reference, teaches the use

of a multi-resolution motion estimation with different block sizes at different

resolution levels, with the usual successive refinement strategy from coarse to fine.

Krause (US Pat No. 5,235,41 9), incorporated herein by reference, teaches the use

of a plurality of block sizes, evaluated in parallel, with selection of a motion vector

from the best match thereby. Similarly, Knauer et al (US Pat. No. 5, 144,423),

incorporated herein by reference, also teaches the use of a two different block

sizes, large and small, but the thrust of that invention is management of bit budget.

In none of these inventions is the signal content of the block used to guide the selection of block size.

Jung (US Pat. No. 5,561 ,475), incorporated herein by reference, teaches the use

of a variably-sized search block where the block size variation is based on

incremental growth of fixed sized blocks to contain an edge. In this invention, the

selected block size is influenced by the content of the block, with the block size

increased at one-pixel increments until the variance of the pixels in the block

exceeds a predetermined threshold. The use of variance as a measure of detail is

unsatisfactory due to the large contribution of first-order (gradient and linear edge)

features that do not permit unambiguous matching.

Restriction of matching to a subset of blocks may also be found in the prior art. Liu

and Zaccarin (US Pat. No. 5,398,068, and US Pat. No. 5,210,605), both

incorporated herein by reference, teach a method whereby only a subset of blocks

are utilized for search purposes, but the subset of blocks is chosen using an

arbitrary pattern, with no consideration of the signal content of the blocks.

Thus, the prior art, while recognizing the importance of detail in block matching,

and also identifying a variety of uses for variably sized block matching, has not yet

learned to take advantage of variable block sizes to manage the compromise

between the desire for large block sizes to ensure sufficient detail to match without

ambiguity against the desire to use the smallest usable block size for the highest

precision in motion estimation, especially at motion borders. Summary of the Invention

It is an object of the present invention to effectively mediate the competing

constraints of large block size for unambiguous matching versus small block size

for increased estimation precision.

It is another object of the present invention to conserve computational resources

by avoiding block searches that are unlikely to provide unambiguous matches.

Further objects and/or advantages of the invention will become apparent in

conjunction with the disclosure herein.

The input to the preferred embodiment of the present invention is a source image

21 , which is to provide a source of pixel blocks for matching against subsequent

or preceding frames of video.

A bottom-up computation of detail 40 in the source image is used to populate an

image pyramid 50 (P. J. Burt, The Pyramid as a Structure for Efficient Computation,

in A. Rosenfeld, Editor, Multiresolution Image Processing and Analysis,

Springer-Verlag, Berlin, 1 984, pp. 6-35, incorporated herein by reference), which

will be familiar to those skilled in the prior art. In the preferred embodiment, blocks

of pixels are required to align with the pyramid grid, so, except at the borders of the image, blocks are sized with power of two dimensions. For each cell in the pyramid,

the measure of detail is the energy in all the high-high terms in the Haar transform

for the pixels underlying the pyramid cell (E. J. Stollnitz, T. D. DeRose, and D. H.

Salesin, Wavelets for Computer Graphics, Morgan Kaufmann, San Francisco, 1 996,

incorporated herein by reference).

The preferred embodiment assigns to each block of pixels in the partition a measure

of ^x ' detail", which is closely correlated to the likelihood of unambiguous matching

of the blocks. In the preferred embodiment, an externally provided threshold 22 for

comparison with the computed detail" is utilized to subdivide output blocks into

" sharp" and ^{" "} flat" blocks, where sharp blocks are considered to have sufficient

detail for unambiguous matching, while flat blocks are considered to lack sufficient

detail for unambiguous matching. Further processing stages may elect to ignore flat

blocks, or devote substantially reduced effort to match evaluation of flat blocks,

thus saving computational resources.

The externally-provided detail threshold 22 is next utilized to build a quad-tree 90

in registration with the image pyramid 50. This computation 60 proceeds by

top-down recursive subdivision of blocks in the quad-tree, starting from the root,

corresponding to the whole image. As the subdivision proceeds, terminal blocks are

accumulated into collections of ^x sharp" blocks 91 , whose block detail exceeds

the detail threshold 22, and ^x flat" blocks 92, for which block detail does not

exceed the threshold 22. When all possible subdivisions have been performed, the undivided blocks form a variably-sized tiling of the original image.

Brief Description of the Drawings

A full understanding of the invention can be gained from the following description

of the preferred embodiments when read in conjunction with the accompanying

drawings in which:

FIG. 1 is a block-diagram of the computations of the preferred embodiment for the

present invention;

FIG. 2 provides the preferred embodiment of the image pyramid data structure

which is used internally for bottom-up computation of block detail;

FIG. 3 displays the preferred embodiment of the pyramid datum;

FIG. 4 displays the kernels used in construction of the image pyramid for the detail

computation;

FIG. 5 describes the preferred embodiment of the algorithm for detail computation

using the detail;

FIG. 6 presents the quad-tree geometry utilized for block subdivision; FIG. 7 displays the quad-tree datum;

FIG. 8 displays the quad-tree data structure in the preferred embodiment;

FIG. 9 provides the preferred embodiment of the algorithm for block subdivision

using the quad-tree; and

FIG. 1 0 is an example of the variable-sized block selection and leaf partition applied

to a real image.

Detailed Description of the Preferred Embodiments and the Drawings

The preferred embodiment of the algorithm proceeds in two main steps, with an

intermediate data structure, as shown in Fig. 1 . The source image 21 is processed

in the detail computation 40 in a bottom-up compution to produce an image

pyramid 50. The intermediate image pyramid 50 is then processed top-down in the

quad-tree subdivision 60. The subdivision is controlled by the externally-supplied

detail threshold 22. The products of the subdivision are the block quad-tree 90, the

leaves of which are non-overlapping variably-sized blocks of pixels, and a

classification of those blocks into ^x " sharp" blocks 91 whose detail is in excess of

the detail threshold 22, and " " flat" blocks 92 whose detail does not exceed the

detail threshold 22. The source image 21 is an intensity image, which is typically luma, but there is no

restriction in application of the invention to chroma or conventional red, blue, or

green channels. However, both video bandwidth and human perceptual sensitivity

are highest for luma, so luma is preferred as an input.

The detail threshold 22 is a scalar. The units of detail are pixel signal energy, and

as such they may be related to the square of the maximum intensity of the pixels

in the source image 21 . In a typical case, with pixel values ranging over 0 - 255,

detail thresholds in the range 1000 - 3000 have been found to give satisfactory

block selections for use in downstream block matching.

The internal pyramid structure 50 will be examined in detail in advance of the detail

computation 40, since the detail computation 40 populates the pyramid 50.

An image pyramid will be familiar to those skilled in the prior art. In the simplest

usage, the image pyramid presents a collection of reduced resolution versions of a

source image, with each reduced resolution image derived from the image at the

next higher level of resolution. The pyramid construction proceeds bottom-up, with

the deepest (highest resolution) level of the pyramid in registration with a source

image.

In Fig. 2, an example of the preferred embodiment of the image pyramid 50 is

presented. The image pyramid 50 contains a scalar depth 501 , which specifies the number of layers 502, each of which is an image, 510, 51 1 , 51 2, 51 3, and 514.

In Fig 2., we have fixed a depth of 5 for purposes of illustration, but of course the

depth may take arbitrary positive integral values. Each image 51 0, etc, contains a

two dimensional array, with each individual element in the array a pyramid datum

5101 , 51 1 1 , 51 21 , 51 31 , and 51 41 . The pyramid datum 51 01 etc. will be

examined in further detail in Fig. 3.

The structure of the image pyramid is tightly coupled to the structure of the source

image 21 . The deepest layer of the pyramid 514 is in correspondence with the

source image 21 . For illustrative purposes in Fig. 2, the source image has been

assumed to comprise an array of 32 x 24 pixels, but, as will be described herein,

there is no restriction placed on the dimensions of the source image 21 .

The pyramid 50 is constructed from the source image using bottom-up process of

coalescence with optional augmentation. In the coalescence process, each

non-overlapping 2 x 2 datum window is associated with a single datum in the

succeeding layer. The coalesced elements are denoted children while the single

datum in the succeeding layer is denoted the parent. Children may be referenced

by offset from the parent using a compass notation {NE, NW, SW, SE}

corresponding to the quadrant of the parent occupied by each child. Any element

in the pyramid may be accessed at random in constant time by the use of three

indices h, i, and j, which specify the depth in the pyramid, and the row and column

in the layer, respectively. The three indices collectively will be denoted a " " pyramid index" 91 1 3 (Fig. 7) .

Since coalescence requires a 2 x 2 datum window, an odd-sized image is

augmented by replication of first or last row or column, as necessary. This process

is illustrated in the construction of 51 1 , the pyramid layer at depth 1 in Fig. 2, from

51 2, the pyramid layer at depth 2. In 51 2, we have 3 x 2 datum elements from the

coalescence of layer 3 51 3. The 3 x 2 datum elements are augmented by

duplication of the top row of elements.

For maximum efficiency in computation, the preferred embodiment of the pyramid

50 utilizes the source image 21 as its deepest layer, unless the source image 21

is odd-sized and requires augmentation. The detail and recursive detail of the

deepest layer, whether it be the source image 21 or an augmented copy thereof,

is defined to have zero detail and zero recursive detail for the purposes of the detail

computation. Note also that elements in the deepest layer of the pyramid have no

children.

In addition to image data providing a series of reduced-resolution versions of the

source image, it is often convenient to associate other data with each element of

the pyramid. In Fig. 3, a pyramid datum 51 01 in the preferred embodiment of the

present invention incorporates detail 5101 2 and recursive detail 51 01 3 in addition

to the usual signal level 5101 1 (image intensity). To populate the pyramid 50 in the preferred embodiment of the present invention,

two kernels are utilized. The signal kernel 521 and the detail kernel 522 are

depicted in Fig. 4. Those skilled in the prior art will recognize that these kernels

correspond to the Lo-Lo and Hi-Hi kernels of the Haar transform, the oldest and

simplest of wavelet transforms. The signal kernel 521 is the simplest low-pass

filter; it is applied here to the construction of reduced resolution copies of the

source image. The detail kernel 522 is utilized to identify image detail which is likely

to match unambiguously. It represents the remaining signal after constant and

first-order (gradient) signal has been removed by the Lo-Lo, Hi-Lo, and Lo-Hi kernels

in the decomposition. Since the constant and first-order signals are prone to

ambiguous matching, the restriction of the detail measure to Hi-Hi is a major

contribution to this invention.

The bottom-up computation of detail 40 using the image pyramid 50 is depicted in

Fig. 6. This computation proceed from the lowest level of the pyramid to the

pyramid's root. At 401 , the local variable level is initialized to the pyramid depth.

At 402, the main loop is controlled by the non-zero property of the level. Inside the

loop at 4021 , the level is decremented. This ensures that the level in the loop will

always lie between 0 and level-1 inclusively. At 4022, each datum in the current

pyramid level is considered. The signal level (e.g., 5101 1 ) and detail (e.g., 5101 2)

for the current datum is computed in 40221 and 40223 by inner product of the

signal level of the children with the signal 521 and detail 522 kernels, respectively.

The recursive detail (e.g., 5101 3) is initialized with the detail 5101 2, then the recursive detail of each child of the datum (if any) is added to the current datum's

recursive detail in 402241 .

At the conclusion of the detail computation 40, the pyramid will contain a measure

of signal, detail and recursive detail for each datum from the base (51 41 , etc) to

the root 5101 . Each pyramid datum corresponds to a window of pixels (a candidate

block) as well as a geometric region in the image. The detail computation proceeded

from the bottom-up, working from the pixels in the source image 21 up to the root

of the pyramid 50, layer by layer. The algorithm now proceeds from top down,

beginning from the root 5101 .

The crucial supporting process in the selection of blocks is subdivision of a block

or window into four equally sized, non-overlapping children, occupying the same

area as the original block. The subdivision of a block is illustrated in Fig. 6. The

parent block 91 00, corresponding to a pyramid datum at level h, row i, column j,

with geometric bounding box ( u , v , u + delta , v + delta ), is subdivided into

four children, 91 01 , 91 02, 9103, with pyramid indices and geometry as shown.

Initially only the root is available for subdivision. When a parent is subdivided, its

children become candidates for subdivision. The quad-tree is a convenient data

structure for the representation and management of this process. The quad-tree will

be familiar to those skilled in the prior art; it provides at a minimum a link from a

parent quad to its children and typically a link back from child to parent. In the preferred embodiment of the present invention, the quad-tree datum is also

provided with a pyramid index to refer to detail and provide geometry information.

The quad-tree datum 91 1 is illustrated in Fig. 7. The datum 91 1 provides a link

91 1 1 to its parent, which is 0 in case the datum 91 1 is the root of the quad-tree.

Also, the orientation 91 1 2 of the child amongst the parent's children is retained in

the datum. The orientation 91 1 2 takes one of the values NE, NW, SW, SE, except

in the case of the root, where the orientation is undefined. The quad-tree datum

91 1 contains a pyramid index 91 1 3 which identifies the associated pyramid datum,

and hence provides a source for detail information as well as geometric information.

The pyramid index 91 1 3 in turn contain individual indices for depth (h) 91 1 31 , row

(i) 91 1 32, and column (j) 91 133. Finally, the quad-tree datum contains links to its

children 91 1 4, if any. There are four children, 91 141 , 91 1 42, 91 143, and 91 1 44,

corresponding to the orientations NE, NW, SW, SE. If the quad-tree datum is a leaf,

the child links will be 0. Otherwise, each child links refers to a distinct quad-tree

datum. As an illustration of the quad-tree linkages, a quad-tree data structure after

subdivision of an arbitrary quad-tree datum associated with pyramid index h , i , j,

as shown in Fig. 8. The parent node 91 20 has been subdivided to provide children

9121 , 91 22, 91 23, and 91 24. The parent and orientation of the parent node 91 20

are not shown as they refer to elements outside of the figure.

With the specification of the quad-tree, The detailed subdivision algorithm 60 is

presented in Fig. 9. This algorithm will provide the block selections quad-tree 91 ,

and the collections of sharp and flat blocks, 92 and 93, respectively. The algorithm operates on the detail pyramid 50, with an externally supplied detail threshold 22

to control the subdivision process. The algorithm makes use of an internal stack of

quad-tree nodes, which is a last-in, first-out collection which will be familiar to

those skilled in the prior art. The stack provides push and pop operations to insert

and remove elements. An alternative embodiment could make use of a recursive

algorithm to obviate the direct use of the stack, possibly with a slight loss of

efficiency. The algorithm also assumes a constructor for quad tree nodes, indicated

by newquad tree node, which requires as arguments the parent quad-tree node and

the pyramid index which is to be associated with the new node.

Initially, the sharp node and flat_nodes are empty (601 , 602). The quad-tree is

initialized to a single node, corresponding to the root of the pyramid (603, 604).

The detail associated with the root is then consulted in 605. If the root node's

detail is in excess of the detail threshold 22, as will typically, but not always, be

the case, the root is pushed onto the stack (6051 ) . Otherwise (606), the root is

added to the flat file collection (6061 ).

The main loop of the algorithm 60 (607) is based on the presence of sharp node

candidates on the stack. No node is placed on the stack unless the detail associated

with its pyramid datum exceeds the threshold. Hence, a node on the stack is either

sharp, or will be subdivided to yield one or more descendent sharp nodes. Thus,

while there are candidates on the stack (607), the algorithm takes a candidate node

(6071 ). Initially, the algorithm assumes the node will not be subdivided (6072). The children of the candidate node are examined in a loop at 6073. The detail

associated with each child is compared to the detail threshold 22 (60731 ). If a child

is found with detail in excess of the threshold (60731 1 ), the subdivision flag is

raised (60731 1 ) and the scan of the children is aborted (60731 2) .

If the subdivison flag was raised (6074), the children are scanned again (60741 ),

and a new quad-tree node is created for each child (60741 1 ). The detail

assocatiated with the child is compared to the detail threshold 22 (60741 2). If

detail in excess of the threshold is found, the child is placed on the stack as a

subdivision candidate (60741 21 ). Otherwise (60741 3), the child is added to the

collection of flat nodes (60741 31 ).

If the subdivision flag was not raised (6075), the node is added to the collection of

sharp nodes (60751 ) .

The algorithm 60 continues until there are no subdivision candidates remaining on

the stack.

Fig. 10 is a demonstration of the algorithm on the famous ^" " Lena" image, here in

51 2x51 2 luma. The detail threshold used here was 1 0,000, which is larger than

usual (1 000-3000), but makes for a better illustration. The tiling shown illustrates

the selected blocks for matching; sharp nodes are indicated with an x, flat nodes

are left empty. Having described tins invention with regard to specific embodiments, it is to be understood that the description is not meant as a limitation since further variations or modifications may be apparent or may suggest themselves to those skilled in the art. It is intended that the present application cover such variations and modifications as fall within the scope of the appended claims.

In addition to the disclosure of the inventions provided herein, several additional

references may be of interest to those of ordinary skill and useful for additional

background and information of relevance. These references include:

1 . S. S. Beauchemin and J. L. Barron, "The Computation of Optic Flow", ACM

Computing Surveys, Vol. 27, No. 8 (Sept. 1 995), pp. 433-467.

2. P. J. Burt, The Pyramid as a Structure for Efficient Computation, in A.

Rosenfeld, Editor, Multiresolution Image Processing and Analysis,

Springer-Verlag, Berlin, 1 984, pp. 6-35.

3. E. J. Stollnitz, T. D. DeRose, and D. H. Salesin, Wavelets for Computer

Graphics, Morgan Kaufmann, San Francisco, 1 996.

Claims

What is claimed is:

1 . A method for selection and classification of variably-sized pixel blocks in an

source image which balances the competing constraints of increasing block size for

unambiguous matching against decreasing block size for accuracy in computation

of motion fields, the method comprising the steps of:

bottom-up computing of a measure of detail for each candidate pixel block from a

source image;

under control of an externally provided detail threshold, top-down splitting of

candidate pixel blocks, said top-down splitting is performed as long at least one of

the pixel blocks resulting from the split has detail in excess of the detail threshold;

and

classifying the pixel blocks split in the preceding step according to whether said

measure of detail in each of said blocks exceeds said externally provided detail

threshold.

2. The method of claim 1 , wherein said bottom-up computing of said measure of

detail utilizes a recursive sum of squared hi-hi Haar coefficients.

3. The method of claim 1 , wherein said step of bottom-up computing is performed

using an image pyramid.

4. The method of claim 3, wherein said top-down splitting is performed using a

quad-tree associated with said image pyramid from claim 3, such that leaves of said

quad-tree correspond to said split pixel blocks.

5. The method of claim 1 , wherein said classifying of said split pixel blocks

according to said measure of detail is embodied in a sharp blocks collection and a

flat blocks collection, wherein said sharp blocks collection contains blocks with said

measure of detail in excess of said detail threshold, and said flat blocks collection

contains blocks with said measure of detail not in excess of said detail threshold.