USE OF IMAGE DETAIL TO
SELECT AND CLASSIFY VARIABLY-SIZED PIXEL BLOCKS
FOR MOTION ESTIMATION
Field of the Invention
The present invention relates to computer-implemented processes and apparatus
for efficient matching of blocks of pixels between distinct frames in a video
sequence.
Background of the Invention and Description of the Prior Art
Pixel block-matching is a crucial component of many processes and apparatus in
video motion estimation. The primary application of video motion estimation is
video compression for efficient transmission over low-bandwidth channels. Other
applications include video frame-rate conversion, temporal interpolation, and noise
removal.
Motion estimation is a prerequisite to exploitation of temporal redundancy in a
digital video signal. Successive frames in a video sequence are likely to possess
substantially similar visual information content. The most widely used motion
estimation techniques in the prior art utilize matching between identically sized
blocks of pixels in temporally adjacent frames. Most matching techniques have in
common a method for evaluation of a possible match, typically using a sum of
absolute pixel differences or sum of squared differences of pixel differences.
The most expensive technique for matching is brute force search of all possible
candidates, usually with some absolute limit on window size. Many block-matching
techniques utilize strategies to limit the number of possible matches to be
considered. A typical strategies include successive refinement of searches from a
wide, coarse grid of candidates to a narrow, fine grid (e.g., US. Pat. No.
5,706,059, incorporated herein by reference). Another popular strategy is to use
lower-resolution versions of the source image for coarse matching, with the results
of low-cost coarse matching serving as initialization of high-cost, high resolution
fine matching (e.g., US. Pat. No. 5,801 ,014 incorporated herein by reference).
All block matching techniques suffer from several intrinsic shortcomings. Most
significantly, there may be many " target" image blocks (at a variety of locations
in the target image) which will match a Λ * source" image block equally well. This
may be easily seen for a block which is of constant intensity throughout; such a
block will match with perfect fidelity anywhere throughout a region of said
intensity. Less obvious but in many ways more pernicious is that a block with only
a strictly linear feature (e.g., a straight edge or a constant intensity gradient) will
match ambiguously anywhere along the edge, allowing identification of only one of
two motion components. In the closely related context of optic flow estimation, it's
well-known in the prior art that only regions of sufficient Gaussian curvature (i.e.,
image detail) can unambiguous motion estimation take place. This problem has been
known for many years as the aperture problem (cf., S. S. Beauchemin and J. L.
Barron, "The Computation of Optic Flow", ACM Computing Surveys, Vol. 27, No.
8, Sept. 1 995, pp. 433-467, incorporated herein by reference). Of course, as the
block size is increased, the likelihood that a block will contain sufficient detail for
an unambiguous match also increases.
Another shortcoming of block search method is their poor handling of natural
motion borders in real scenes. Rarely do such borders fall on an orderly block grid.
In a typical scene, many blocks will straddle regions of disparate motion. In these
cases, the best match may be a motion vector corresponding to either region, or
a compromise between the vectors. In this situation, the use of smaller blocks will
usually reduce the degree of overlap, permitting better resolution of the motion
edge.
In effect, conventional block matching techniques must select a block size for
matching which compromises between these incompatible constraints: larger blocks
for less matching ambiguity, smaller blocks for higher precision in detection and
representation of motion borders in the scene.
Several inventors have addressed aspects of this unpleasant tradeoff in the prior
art.
The importance of detail in block matching is acknowledged in several inventions.
Astle (US. Pat. No. 6,020,926), incorporated herein by reference, teaches a
technique for restricting the error computation in a potential block match mainly to
pixels in regions of high luminance gradient, for the purpose of reducing the
expense of a block comparison. There is no concept of varying the block size to
accommodate the uneven distribution of detail around a scene, however. Reitmeier
(US Pat. No. 5,987, 1 80), incorporated herein by reference, teaches the use of
chrominance to make up for missing luma detail. Kundu (US Pat. No. 5,974, 1 92),
incorporated herein by reference, teaches a categorization of pixels according to
qualitative characteristics which recognizes the ambiguity of matching where
textural information is lacking. Closely related is Jung (US Pat. No. 5,808,685),
incorporated herein by reference, which weights error signals using local pixel
variance or gradient estimation.
The use of variably-sized search blocks may also be found in the prior art. Zhang
et al. (US Pat. No. 5,477,272), incorporated herein by reference, teaches the use
of a multi-resolution motion estimation with different block sizes at different
resolution levels, with the usual successive refinement strategy from coarse to fine.
Krause (US Pat No. 5,235,41 9), incorporated herein by reference, teaches the use
of a plurality of block sizes, evaluated in parallel, with selection of a motion vector
from the best match thereby. Similarly, Knauer et al (US Pat. No. 5, 144,423),
incorporated herein by reference, also teaches the use of a two different block
sizes, large and small, but the thrust of that invention is management of bit budget.
In none of these inventions is the signal content of the block used to guide the
selection of block size.
Jung (US Pat. No. 5,561 ,475), incorporated herein by reference, teaches the use
of a variably-sized search block where the block size variation is based on
incremental growth of fixed sized blocks to contain an edge. In this invention, the
selected block size is influenced by the content of the block, with the block size
increased at one-pixel increments until the variance of the pixels in the block
exceeds a predetermined threshold. The use of variance as a measure of detail is
unsatisfactory due to the large contribution of first-order (gradient and linear edge)
features that do not permit unambiguous matching.
Restriction of matching to a subset of blocks may also be found in the prior art. Liu
and Zaccarin (US Pat. No. 5,398,068, and US Pat. No. 5,210,605), both
incorporated herein by reference, teach a method whereby only a subset of blocks
are utilized for search purposes, but the subset of blocks is chosen using an
arbitrary pattern, with no consideration of the signal content of the blocks.
Thus, the prior art, while recognizing the importance of detail in block matching,
and also identifying a variety of uses for variably sized block matching, has not yet
learned to take advantage of variable block sizes to manage the compromise
between the desire for large block sizes to ensure sufficient detail to match without
ambiguity against the desire to use the smallest usable block size for the highest
precision in motion estimation, especially at motion borders.
Summary of the Invention
It is an object of the present invention to effectively mediate the competing
constraints of large block size for unambiguous matching versus small block size
for increased estimation precision.
It is another object of the present invention to conserve computational resources
by avoiding block searches that are unlikely to provide unambiguous matches.
Further objects and/or advantages of the invention will become apparent in
conjunction with the disclosure herein.
The input to the preferred embodiment of the present invention is a source image
21 , which is to provide a source of pixel blocks for matching against subsequent
or preceding frames of video.
A bottom-up computation of detail 40 in the source image is used to populate an
image pyramid 50 (P. J. Burt, The Pyramid as a Structure for Efficient Computation,
in A. Rosenfeld, Editor, Multiresolution Image Processing and Analysis,
Springer-Verlag, Berlin, 1 984, pp. 6-35, incorporated herein by reference), which
will be familiar to those skilled in the prior art. In the preferred embodiment, blocks
of pixels are required to align with the pyramid grid, so, except at the borders of the
image, blocks are sized with power of two dimensions. For each cell in the pyramid,
the measure of detail is the energy in all the high-high terms in the Haar transform
for the pixels underlying the pyramid cell (E. J. Stollnitz, T. D. DeRose, and D. H.
Salesin, Wavelets for Computer Graphics, Morgan Kaufmann, San Francisco, 1 996,
incorporated herein by reference).
The preferred embodiment assigns to each block of pixels in the partition a measure
of x ' detail", which is closely correlated to the likelihood of unambiguous matching
of the blocks. In the preferred embodiment, an externally provided threshold 22 for
comparison with the computed detail" is utilized to subdivide output blocks into
" sharp" and " " flat" blocks, where sharp blocks are considered to have sufficient
detail for unambiguous matching, while flat blocks are considered to lack sufficient
detail for unambiguous matching. Further processing stages may elect to ignore flat
blocks, or devote substantially reduced effort to match evaluation of flat blocks,
thus saving computational resources.
The externally-provided detail threshold 22 is next utilized to build a quad-tree 90
in registration with the image pyramid 50. This computation 60 proceeds by
top-down recursive subdivision of blocks in the quad-tree, starting from the root,
corresponding to the whole image. As the subdivision proceeds, terminal blocks are
accumulated into collections of x sharp" blocks 91 , whose block detail exceeds
the detail threshold 22, and x flat" blocks 92, for which block detail does not
exceed the threshold 22. When all possible subdivisions have been performed, the
undivided blocks form a variably-sized tiling of the original image.
Brief Description of the Drawings
A full understanding of the invention can be gained from the following description
of the preferred embodiments when read in conjunction with the accompanying
drawings in which:
FIG. 1 is a block-diagram of the computations of the preferred embodiment for the
present invention;
FIG. 2 provides the preferred embodiment of the image pyramid data structure
which is used internally for bottom-up computation of block detail;
FIG. 3 displays the preferred embodiment of the pyramid datum;
FIG. 4 displays the kernels used in construction of the image pyramid for the detail
computation;
FIG. 5 describes the preferred embodiment of the algorithm for detail computation
using the detail;
FIG. 6 presents the quad-tree geometry utilized for block subdivision;
FIG. 7 displays the quad-tree datum;
FIG. 8 displays the quad-tree data structure in the preferred embodiment;
FIG. 9 provides the preferred embodiment of the algorithm for block subdivision
using the quad-tree; and
FIG. 1 0 is an example of the variable-sized block selection and leaf partition applied
to a real image.
Detailed Description of the Preferred Embodiments and the Drawings
The preferred embodiment of the algorithm proceeds in two main steps, with an
intermediate data structure, as shown in Fig. 1 . The source image 21 is processed
in the detail computation 40 in a bottom-up compution to produce an image
pyramid 50. The intermediate image pyramid 50 is then processed top-down in the
quad-tree subdivision 60. The subdivision is controlled by the externally-supplied
detail threshold 22. The products of the subdivision are the block quad-tree 90, the
leaves of which are non-overlapping variably-sized blocks of pixels, and a
classification of those blocks into x " sharp" blocks 91 whose detail is in excess of
the detail threshold 22, and " " flat" blocks 92 whose detail does not exceed the
detail threshold 22.
The source image 21 is an intensity image, which is typically luma, but there is no
restriction in application of the invention to chroma or conventional red, blue, or
green channels. However, both video bandwidth and human perceptual sensitivity
are highest for luma, so luma is preferred as an input.
The detail threshold 22 is a scalar. The units of detail are pixel signal energy, and
as such they may be related to the square of the maximum intensity of the pixels
in the source image 21 . In a typical case, with pixel values ranging over 0 - 255,
detail thresholds in the range 1000 - 3000 have been found to give satisfactory
block selections for use in downstream block matching.
The internal pyramid structure 50 will be examined in detail in advance of the detail
computation 40, since the detail computation 40 populates the pyramid 50.
An image pyramid will be familiar to those skilled in the prior art. In the simplest
usage, the image pyramid presents a collection of reduced resolution versions of a
source image, with each reduced resolution image derived from the image at the
next higher level of resolution. The pyramid construction proceeds bottom-up, with
the deepest (highest resolution) level of the pyramid in registration with a source
image.
In Fig. 2, an example of the preferred embodiment of the image pyramid 50 is
presented. The image pyramid 50 contains a scalar depth 501 , which specifies the
number of layers 502, each of which is an image, 510, 51 1 , 51 2, 51 3, and 514.
In Fig 2., we have fixed a depth of 5 for purposes of illustration, but of course the
depth may take arbitrary positive integral values. Each image 51 0, etc, contains a
two dimensional array, with each individual element in the array a pyramid datum
5101 , 51 1 1 , 51 21 , 51 31 , and 51 41 . The pyramid datum 51 01 etc. will be
examined in further detail in Fig. 3.
The structure of the image pyramid is tightly coupled to the structure of the source
image 21 . The deepest layer of the pyramid 514 is in correspondence with the
source image 21 . For illustrative purposes in Fig. 2, the source image has been
assumed to comprise an array of 32 x 24 pixels, but, as will be described herein,
there is no restriction placed on the dimensions of the source image 21 .
The pyramid 50 is constructed from the source image using bottom-up process of
coalescence with optional augmentation. In the coalescence process, each
non-overlapping 2 x 2 datum window is associated with a single datum in the
succeeding layer. The coalesced elements are denoted children while the single
datum in the succeeding layer is denoted the parent. Children may be referenced
by offset from the parent using a compass notation {NE, NW, SW, SE}
corresponding to the quadrant of the parent occupied by each child. Any element
in the pyramid may be accessed at random in constant time by the use of three
indices h, i, and j, which specify the depth in the pyramid, and the row and column
in the layer, respectively. The three indices collectively will be denoted a " " pyramid
index" 91 1 3 (Fig. 7) .
Since coalescence requires a 2 x 2 datum window, an odd-sized image is
augmented by replication of first or last row or column, as necessary. This process
is illustrated in the construction of 51 1 , the pyramid layer at depth 1 in Fig. 2, from
51 2, the pyramid layer at depth 2. In 51 2, we have 3 x 2 datum elements from the
coalescence of layer 3 51 3. The 3 x 2 datum elements are augmented by
duplication of the top row of elements.
For maximum efficiency in computation, the preferred embodiment of the pyramid
50 utilizes the source image 21 as its deepest layer, unless the source image 21
is odd-sized and requires augmentation. The detail and recursive detail of the
deepest layer, whether it be the source image 21 or an augmented copy thereof,
is defined to have zero detail and zero recursive detail for the purposes of the detail
computation. Note also that elements in the deepest layer of the pyramid have no
children.
In addition to image data providing a series of reduced-resolution versions of the
source image, it is often convenient to associate other data with each element of
the pyramid. In Fig. 3, a pyramid datum 51 01 in the preferred embodiment of the
present invention incorporates detail 5101 2 and recursive detail 51 01 3 in addition
to the usual signal level 5101 1 (image intensity).
To populate the pyramid 50 in the preferred embodiment of the present invention,
two kernels are utilized. The signal kernel 521 and the detail kernel 522 are
depicted in Fig. 4. Those skilled in the prior art will recognize that these kernels
correspond to the Lo-Lo and Hi-Hi kernels of the Haar transform, the oldest and
simplest of wavelet transforms. The signal kernel 521 is the simplest low-pass
filter; it is applied here to the construction of reduced resolution copies of the
source image. The detail kernel 522 is utilized to identify image detail which is likely
to match unambiguously. It represents the remaining signal after constant and
first-order (gradient) signal has been removed by the Lo-Lo, Hi-Lo, and Lo-Hi kernels
in the decomposition. Since the constant and first-order signals are prone to
ambiguous matching, the restriction of the detail measure to Hi-Hi is a major
contribution to this invention.
The bottom-up computation of detail 40 using the image pyramid 50 is depicted in
Fig. 6. This computation proceed from the lowest level of the pyramid to the
pyramid's root. At 401 , the local variable level is initialized to the pyramid depth.
At 402, the main loop is controlled by the non-zero property of the level. Inside the
loop at 4021 , the level is decremented. This ensures that the level in the loop will
always lie between 0 and level-1 inclusively. At 4022, each datum in the current
pyramid level is considered. The signal level (e.g., 5101 1 ) and detail (e.g., 5101 2)
for the current datum is computed in 40221 and 40223 by inner product of the
signal level of the children with the signal 521 and detail 522 kernels, respectively.
The recursive detail (e.g., 5101 3) is initialized with the detail 5101 2, then the
recursive detail of each child of the datum (if any) is added to the current datum's
recursive detail in 402241 .
At the conclusion of the detail computation 40, the pyramid will contain a measure
of signal, detail and recursive detail for each datum from the base (51 41 , etc) to
the root 5101 . Each pyramid datum corresponds to a window of pixels (a candidate
block) as well as a geometric region in the image. The detail computation proceeded
from the bottom-up, working from the pixels in the source image 21 up to the root
of the pyramid 50, layer by layer. The algorithm now proceeds from top down,
beginning from the root 5101 .
The crucial supporting process in the selection of blocks is subdivision of a block
or window into four equally sized, non-overlapping children, occupying the same
area as the original block. The subdivision of a block is illustrated in Fig. 6. The
parent block 91 00, corresponding to a pyramid datum at level h, row i, column j,
with geometric bounding box ( u , v , u + delta , v + delta ), is subdivided into
four children, 91 01 , 91 02, 9103, with pyramid indices and geometry as shown.
Initially only the root is available for subdivision. When a parent is subdivided, its
children become candidates for subdivision. The quad-tree is a convenient data
structure for the representation and management of this process. The quad-tree will
be familiar to those skilled in the prior art; it provides at a minimum a link from a
parent quad to its children and typically a link back from child to parent.
In the preferred embodiment of the present invention, the quad-tree datum is also
provided with a pyramid index to refer to detail and provide geometry information.
The quad-tree datum 91 1 is illustrated in Fig. 7. The datum 91 1 provides a link
91 1 1 to its parent, which is 0 in case the datum 91 1 is the root of the quad-tree.
Also, the orientation 91 1 2 of the child amongst the parent's children is retained in
the datum. The orientation 91 1 2 takes one of the values NE, NW, SW, SE, except
in the case of the root, where the orientation is undefined. The quad-tree datum
91 1 contains a pyramid index 91 1 3 which identifies the associated pyramid datum,
and hence provides a source for detail information as well as geometric information.
The pyramid index 91 1 3 in turn contain individual indices for depth (h) 91 1 31 , row
(i) 91 1 32, and column (j) 91 133. Finally, the quad-tree datum contains links to its
children 91 1 4, if any. There are four children, 91 141 , 91 1 42, 91 143, and 91 1 44,
corresponding to the orientations NE, NW, SW, SE. If the quad-tree datum is a leaf,
the child links will be 0. Otherwise, each child links refers to a distinct quad-tree
datum. As an illustration of the quad-tree linkages, a quad-tree data structure after
subdivision of an arbitrary quad-tree datum associated with pyramid index h , i , j,
as shown in Fig. 8. The parent node 91 20 has been subdivided to provide children
9121 , 91 22, 91 23, and 91 24. The parent and orientation of the parent node 91 20
are not shown as they refer to elements outside of the figure.
With the specification of the quad-tree, The detailed subdivision algorithm 60 is
presented in Fig. 9. This algorithm will provide the block selections quad-tree 91 ,
and the collections of sharp and flat blocks, 92 and 93, respectively. The algorithm
operates on the detail pyramid 50, with an externally supplied detail threshold 22
to control the subdivision process. The algorithm makes use of an internal stack of
quad-tree nodes, which is a last-in, first-out collection which will be familiar to
those skilled in the prior art. The stack provides push and pop operations to insert
and remove elements. An alternative embodiment could make use of a recursive
algorithm to obviate the direct use of the stack, possibly with a slight loss of
efficiency. The algorithm also assumes a constructor for quad tree nodes, indicated
by newquad tree node, which requires as arguments the parent quad-tree node and
the pyramid index which is to be associated with the new node.
Initially, the sharp node and flat_nodes are empty (601 , 602). The quad-tree is
initialized to a single node, corresponding to the root of the pyramid (603, 604).
The detail associated with the root is then consulted in 605. If the root node's
detail is in excess of the detail threshold 22, as will typically, but not always, be
the case, the root is pushed onto the stack (6051 ) . Otherwise (606), the root is
added to the flat file collection (6061 ).
The main loop of the algorithm 60 (607) is based on the presence of sharp node
candidates on the stack. No node is placed on the stack unless the detail associated
with its pyramid datum exceeds the threshold. Hence, a node on the stack is either
sharp, or will be subdivided to yield one or more descendent sharp nodes. Thus,
while there are candidates on the stack (607), the algorithm takes a candidate node
(6071 ). Initially, the algorithm assumes the node will not be subdivided (6072). The
children of the candidate node are examined in a loop at 6073. The detail
associated with each child is compared to the detail threshold 22 (60731 ). If a child
is found with detail in excess of the threshold (60731 1 ), the subdivision flag is
raised (60731 1 ) and the scan of the children is aborted (60731 2) .
If the subdivison flag was raised (6074), the children are scanned again (60741 ),
and a new quad-tree node is created for each child (60741 1 ). The detail
assocatiated with the child is compared to the detail threshold 22 (60741 2). If
detail in excess of the threshold is found, the child is placed on the stack as a
subdivision candidate (60741 21 ). Otherwise (60741 3), the child is added to the
collection of flat nodes (60741 31 ).
If the subdivision flag was not raised (6075), the node is added to the collection of
sharp nodes (60751 ) .
The algorithm 60 continues until there are no subdivision candidates remaining on
the stack.
Fig. 10 is a demonstration of the algorithm on the famous " " Lena" image, here in
51 2x51 2 luma. The detail threshold used here was 1 0,000, which is larger than
usual (1 000-3000), but makes for a better illustration. The tiling shown illustrates
the selected blocks for matching; sharp nodes are indicated with an x, flat nodes
are left empty.
Having described tins invention with regard to specific embodiments, it is to be understood that the description is not meant as a limitation since further variations or modifications may be apparent or may suggest themselves to those skilled in the art. It is intended that the present application cover such variations and modifications as fall within the scope of the appended claims.
In addition to the disclosure of the inventions provided herein, several additional
references may be of interest to those of ordinary skill and useful for additional
background and information of relevance. These references include:
1 . S. S. Beauchemin and J. L. Barron, "The Computation of Optic Flow", ACM
Computing Surveys, Vol. 27, No. 8 (Sept. 1 995), pp. 433-467.
2. P. J. Burt, The Pyramid as a Structure for Efficient Computation, in A.
Rosenfeld, Editor, Multiresolution Image Processing and Analysis,
Springer-Verlag, Berlin, 1 984, pp. 6-35.
3. E. J. Stollnitz, T. D. DeRose, and D. H. Salesin, Wavelets for Computer
Graphics, Morgan Kaufmann, San Francisco, 1 996.