WO1999034331A1 - Apparatus and method for performing scalable hierarchical motion estimation - Google Patents

Apparatus and method for performing scalable hierarchical motion estimation Download PDF

Info

Publication number
WO1999034331A1
WO1999034331A1 PCT/US1998/027543 US9827543W WO9934331A1 WO 1999034331 A1 WO1999034331 A1 WO 1999034331A1 US 9827543 W US9827543 W US 9827543W WO 9934331 A1 WO9934331 A1 WO 9934331A1
Authority
WO
WIPO (PCT)
Prior art keywords
pyramid
ary
image
ary pyramid
generating
Prior art date
Application number
PCT/US1998/027543
Other languages
French (fr)
Inventor
Song Xudong
Tihao Chiang
Ya-Qin Zhang
Ravi Krishnamurthy
Original Assignee
Sarnoff Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/002,258 external-priority patent/US6408101B1/en
Application filed by Sarnoff Corporation filed Critical Sarnoff Corporation
Priority to JP2000526903A priority Critical patent/JP2002500402A/en
Priority to KR1020007007344A priority patent/KR20010033797A/en
Priority to EP98964302A priority patent/EP1042734A1/en
Priority to AU19470/99A priority patent/AU1947099A/en
Publication of WO1999034331A1 publication Critical patent/WO1999034331A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/53Multi-resolution motion estimation; Hierarchical motion estimation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/56Motion estimation with initialisation of the vector search, e.g. estimating a good candidate to initiate a search
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/63Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using sub-band based transform, e.g. wavelets

Definitions

  • the invention relates generally to a system for encoding image sequences and, more particularly, to an apparatus and a concomitant method for performing hierarchical block-based motion estimation with a high degree of scalability.
  • An image sequence such as a video image sequence, typically includes a sequence of image frames or pictures.
  • the reproduction of video containing moving objects typically requires a frame speed of thirty image frames per second, with each frame possibly containing in excess of a megabyte of information. Consequently, transmitting or storing such image sequences requires a large amount of either transmission bandwidth or storage capacity.
  • the frame sequence is compressed such that redundant information within the sequence is not stored or transmitted.
  • Television, video conferencing and CD-ROM archiving are examples of applications which can benefit from efficient video sequence encoding.
  • motion estimation is a process of determining the direction and magnitude of motion (motion vectors) for an area (e.g., a block or macroblock) in the current frame relative to one or more reference frames.
  • motion compensation is a process of using the motion vectors to generate a prediction (predicted image) of the current frame. The difference between the current frame and the predicted frame results in a residual signal (error signal), which contains substantially less information than the current frame itself.
  • encoder designers must address the dichotomy of attempting to increase the precision of the motion estimation process to minimize the residual signal (i.e., reducing coding bits) or accepting a lower level of precision in the motion estimation process to minimize the computational overhead. Namely, determining the motion vectors from the frame sequence requires intensive searching between frames to determine the motion information. A more intensive search will generate a more precise set of motion vectors at the expense of more computational cycles.
  • some systems determine motion information using a so-called block based approach.
  • the current frame is divided into a number of blocks of pixels (referred to hereinafter as the "current blocks").
  • a search is performed within a selected search area in the preceding frame for a block of pixels that "best" matches the current block.
  • This search is typically accomplished by repetitively comparing a selected current block to similarly sized blocks of pixels in the selected search area of the preceding frame.
  • the determination of motion vectors by this exhaustive search approach is computationally intensive, especially where the search area is particularly large.
  • HME hierarchical motion estimation
  • an image is decomposed into a multiresolution framework, i.e., a pyramid.
  • a hierarchical motion vector search is then performed, where the search proceeds from the lowest resolution to the highest resolution of the pyramid.
  • HME has been demonstrated to be a fast and effective motion estimation method
  • the generation of the pyramid still incurs a significant amount of computational cycles.
  • the above motion estimation methods are not easily scalable. Namely, the architecture of these motion estimation methods do not provide a user or an encoder with the flexibility to scale or switch to a different architecture to account for available computational resources and/or user's choices.
  • An embodiment of the present invention is an apparatus and method for performing hierarchical block-based motion estimation with a high degree of scalability.
  • the present scalable hierarchical motion estimation architecture provides the flexibility of switching from one-bit/pixel to eight-bit/pixel representation according to the available platform resources and/or user's choice. More specifically, the present invention decomposes each of the image frames within an image sequence into an M-ary pyramid, e.g., a four level binary pyramid. Different dynamic ranges for representing the pixel values are used for different levels of the binary pyramid, thereby generating a plurality of different "P-bit" levels.
  • FIG. 1 illustrates a block diagram of the encoder of the present invention
  • FIG. 2 illustrates a flowchart of a method for reducing the computational complexity in determining motion vectors for block-based motion estimation
  • FIG. 3 illustrates a block diagram of a general mean pyramid
  • FIG. 4 illustrates a block diagram of the quantization process that generates an M-ary pyramid
  • FIG. 5 illustrates an input frame which has been divided and classified into a plurality of blocks
  • FIG. 6 illustrates an encoding system of the present invention
  • FIG. 7 illustrates a block diagram of a block of pixels with multi-scale tiling
  • FIG. 8 illustrates a block diagram of a second embodiment of the apparatus of the present invention.
  • FIG. 9 illustrates a graphical representation of a wavelet tree
  • FIG. 10 illustrates a flowchart of a method for generating an M-ary pyramid for an image
  • FIG. 11 illustrates a flowchart of a method for performing scalable motion estimation on an M-ary pyramid
  • FIG. 12 illustrates a block diagram of a plurality of different M-ary pyramid architectures.
  • FIG. 1 depicts a block diagram of the apparatus 100 of the present invention for reducing the computational complexity in determining motion vectors for block-based motion estimation.
  • the preferred embodiment of the present invention is described below using an encoder, but it should be understood that the present invention can be employed in image processing systems in general.
  • the present invention can be employed in encoders that are in compliant with various coding standards. These standards include, but are not limited to, the Moving Picture Experts Group Standards (e.g., MPEG-1 (11172-*) and MPEG-2 (13818-*), H.261 and H.263.
  • the apparatus 100 is an encoder or a portion of a more complex block-based motion compensated coding system.
  • the apparatus 100 comprises a motion estimation module 140, a motion compensation module 150, an optional segmentation module 151, a preprocessing module 120, a rate control module 130, a transform module, (e.g., a DCT module) 160, a quantization module 170, a coder, (e.g., a variable length coding module) 180, a buffer 190, an inverse quantization module 175, an inverse transform module (e.g., an inverse DCT module) 165, a subtractor 115 and a summer 155.
  • a transform module e.g., a DCT module
  • a quantization module 170 e.g., a coder, (e.g., a variable length coding module) 180
  • a buffer 190 e.g., an inverse quantization module 175, an inverse transform module (e.g., an inverse DCT
  • FIG. 1 illustrates an input image (image sequence) on path 110 which is digitized and represented as a luminance and two color difference signals (Y, C r , C b ) in accordance with the MPEG standards. These signals are further divided into a plurality of layers such that each picture (frame) is represented by a plurality of macroblocks.
  • Each macroblock comprises four (4) luminance blocks, one C r block and one C b block where a block is defined as an eight (8) by eight (8) sample array.
  • the division of a picture into block units improves the ability to discern changes between two successive pictures and improves image compression through the elimination of low amplitude transformed coefficients (discussed below).
  • the term macroblock or block in the present invention is intended to describe a block of pixels of any size or shape that is used for the basis of encoding. Broadly speaking, a "macroblock" could be as small as a single pixel, or as large as an entire video frame.
  • the digitized input image signal undergoes one or more preprocessing steps in the preprocessing module 120. More specifically, preprocessing module 120 comprises an M-ary pyramid generator 122 and a block classifier 124.
  • the M-ary pyramid generator 122 employs a mean filter 123a and a quantizer 123b to filter and to quantize each frame into a plurality of different resolutions, i.e., an M-ary pyramid of resolutions, where the different resolutions of each frame are correlated in a hierarchical fashion as described below.
  • the block classifier 124 is able to quickly classify areas (blocks) as areas of high activity or low activity. A detailed description is provided below for the functions performed by the preprocessing module 120.
  • the input image on path 110 is also received into motion estimation module 140 for estimating motion vectors.
  • a motion vector is a two-dimensional vector which is used by motion compensation to provide an offset from the coordinate position of a block in the current picture to the coordinates in a reference frame.
  • the use of motion vectors greatly enhances image compression by reducing the amount of information that is transmitted on a channel because only the changes within the current frame are coded and transmitted.
  • the motion estimation module 140 also receives information from the preprocessing module 120 to enhance the performance of the motion estimation process.
  • the motion vectors from the motion estimation module 140 are received by the motion compensation module 150 for improving the efficiency of the prediction of sample values.
  • Motion compensation involves a prediction that uses motion vectors to provide offsets into the past and/or future reference frames containing previously decoded sample values, and is used to form the prediction error. Namely, the motion compensation module 150 uses the previously decoded frame and the motion vectors to construct an estimate (motion compensated prediction or predicted image) of the current frame on path 152. This motion compensated prediction is subtracted via subtractor 115 from the input image on path 110 in the current macroblocks to form an error signal (e) or predictive residual on path 153.
  • the predictive residual signal is passed to a transform module, e.g., a DCT module 160.
  • the DCT module then applies a forward discrete cosine transform process to each block of the predictive residual signal to produce a set of eight (8) by eight (8) block of DCT coefficients.
  • the discrete cosine transform is an invertible, discrete orthogonal transformation where the DCT coefficients represent the amplitudes of a set of cosine basis functions.
  • the resulting 8 x 8 block of DCT coefficients is received by quantization (Q) module 170, where the DCT coefficients are quantized.
  • the process of quantization reduces the accuracy with which the DCT coefficients are represented by dividing the DCT coefficients by a set of quantization values or scales with appropriate rounding to form integer values.
  • the quantization values can be set individually for each DCT coefficient, using criteria based on the visibility of the basis functions (known as visually weighted quantization). By quantizing the DCT coefficients with this value, many of the DCT coefficients are converted to zeros, thereby improving image compression efficiency.
  • variable length coding module 180 receives the resulting 8 x 8 block of quantized DCT coefficients from a coder, e.g., variable length coding module 180 via signal connection 171, where the two-dimensional block of quantized coefficients is scanned in a "zig-zag" order to convert it into a one-dimensional string of quantized DCT coefficients.
  • Variable length coding (VLC) module 180 then encodes the string of quantized DCT coefficients and all side-information for the macroblock such as macroblock type and motion vectors. Thus, the VLC module 180 performs the final step of converting the input image into a valid data stream.
  • the data stream is received into a buffer, e.g., a "First In-First Out” (FIFO) buffer 190.
  • a buffer e.g., a "First In-First Out” (FIFO) buffer 190.
  • FIFO First In-First Out
  • a consequence of using different picture types and variable length coding is that the overall bit rate is variable. Namely, the number of bits used to code each frame can be different.
  • a FIFO buffer is used to match the encoder output to the channel for smoothing the bit rate.
  • the output signal on path 195 from FIFO buffer 190 is a compressed representation of the input image 110, where it is sent to a storage medium or a telecommunication channel.
  • the rate control module 130 serves to monitor and adjust the bit rate of the data stream entering the FIFO buffer 190 to prevent overflow and underflow on the decoder side (within a receiver or target storage device, not shown) after transmission of the data stream.
  • a fixed-rate channel is assumed to carry bits at a constant rate to an input buffer within the decoder (not shown).
  • the decoder instantaneously removes all the bits for the next picture from its input buffer. If there are too few bits in the input buffer, i.e., all the bits for the next picture have not been received, then the input buffer underflows resulting in an error.
  • a rate control method may control the number of coding bits by adjusting the quantization scales.
  • the encoder regenerates I-frames and P-frames of the image sequence by decoding the data so that they are used as reference frames for subsequent encoding.
  • FIG. 2 illustrates a flowchart of a method 200 for reducing the computational complexity in determining motion vectors for block-based motion estimation. Namely, the method 200 enhances a block-based motion estimation method by quickly defining an initial search area where a match will likely occur.
  • method 200 starts in step 205 and proceeds to step 210 where an M-ary pyramid (or M-ary mean pyramid) is generated for each image frame in the image sequence.
  • M-ary pyramid or M-ary mean pyramid
  • FIG. 10 illustrates a flowchart of a method 1000 for generating an M-ary pyramid for an image.
  • the method starts in step 1005 and proceeds to step 1010 where the original image is decomposed into a mean pyramid of images as illustrated in FIG. 3.
  • FIG. 3 illustrates a block diagram of a general mean pyramid 300, where the mean pyramid comprises a plurality of levels 310, 320 and 330.
  • the lowest level 310 is an original image frame from the image sequence having a plurality of pixels 311 represented by "x"s.
  • these pixels are represented by pixel values having a dynamic range that is limited by the number of bits allocated to represent the pixel values. For example, if eight (8) bits are allocated, then a pixel value may take a value from one of 256 possible values.
  • a next higher level is generated by lowpass filtering and downsampling by a factor of two in both directions, thereby generating a single pixel value (parent) for a higher level from four (4) pixel values (children) in a lower level.
  • This is illustrated in FIG. 3, where each set of four pixels 312a- d is used to generate a single pixel value 321 in level 320.
  • the set of four pixel values 322a is used to generate a single pixel value 331 in level 330 and so on.
  • the present invention is not limited to a mean pyramid having three levels. The number of levels is generally limited by the size of the image and the downsampling factor selected to generate the next lower resolution image. Thus, the number of levels in the mean pyramid can be selected for a particular application.
  • a mean pyramid the parent pixel value is derived by taking the average of its four children pixel values, thus the term mean pyramid.
  • the measure can be based on the median of the four children pixel values.
  • a larger area around the children pixels can be used for a weighted average to obtain a general lowpass pyramid.
  • each of these different types of pyramids having a set of pyramidal images e.g., a mean pyramid, a median pyramid, a lowpass pyramid and the like
  • the M-ary pyramid is generated.
  • method 1000 then generate an M-ary pyramid from said mean pyramid in step 1020.
  • M-ary pyramid the pixel values are quantized such that each quantized pixel value can only take "M” possible pixel values as illustrated in FIG. 4 below. For example, if M equals to two (2), then each quantized pixel value, can take on a value of 0 or 1, i.e., resulting in a "binary pyramid".
  • different dynamic ranges for representing the pixel values are used for different levels of the binary pyramid, thereby generating a plurality of different "P-bit" levels or layers.
  • a level of the M-ary pyramid having eight bit/pixels is referred to as an 8-bit level (illustrated as an ⁇ " level in FIG. 12), whereas a level of the M-ary pyramid having one bit/pixels (e.g., Boolean) is referred to as an 1-bit level (illustrated as an "O" level in FIG. 12).
  • the mean pyramid 300 as discussed above comprises a plurality of ⁇ " levels. The distinction and combinatorial use of these "E” and "O" levels are discussed further below.
  • FIG. 4 illustrates a block diagram of the quantization process that generates a ternary pyramid, where M equals to three (3). More specifically, an eight-bit pixel value 255 (410a) is quantized into a two-bit pixel value 10 (420a) based on the difference between the child and parent pixels. Namely, a difference is computed between a parent 430a and each of its children 410a-d, where each of the four (4) differences is then quantized into three possible values 10, 00, and 01. Thus, pixel value 128 (410b and 410c) is quantized into a pixel value 00 (420b and 420c) and pixel value 0 (410d) is quantized into a pixel value 01 (420d).
  • the M-ary pyramid reduces accuracy of the pixel values, thereby allowing rapid detection of "features" within an image.
  • Features are defined as areas of high activities or intensity, e.g., the edges of an object.
  • the significant reduction in the number of bits used to represent the pixel values translates into a reduction in computational overhead in the motion estimation process.
  • the block matching operation performed in the motion estimation process can be accelerated since there are fewer possible values that a pixel value can take on, thereby simplifying the overall block matching process.
  • M can be any value
  • 'lower order M-ary pyramid e.g., a binary pyramid decomposition
  • a "higher order" M-ary pyramid e.g., a ternary pyramid.
  • the quantized pixel values in a binary pyramid can only take one of two possible values, noise may introduce errors, where a pixel value can be erroneously interpreted as having a value 1 instead of 0 or vice versa.
  • a "higher order" M-ary pyramid requires more computational overhead.
  • method 1000 ends in step 1030 and returns to step 220 of FIG. 2.
  • step 210 the important aspect in step 210 is the generation of an M-ary pyramid for each of the input images in the image sequence.
  • M-ary mean pyramid other types of M-ary pyramids can be employed in the present invention, e.g., an M-ary median pyramid, an M-ary Lowpass pyramid and so on.
  • inventive concept of a M-ary mean pyramid decomposition can be expressed in equation form. Let (i, j) represent the pixel locations on an image frame and let I(i, j) represent the intensity at location (i, j). Further, let 1 indicate the level within a pyramid, with 0 ⁇ 1 ⁇ L, where L is the highest level in the pyramid. Then, the mean pyramids X'(i, j),l ⁇ l ⁇ L are constructed as follows:
  • the block is a 8 x 8 subblock of a macroblock, but it should be understood that the present invention is not limited to this block size.
  • features like edges can be extracted from the variation of intensities within a block. This variation is represented by calculating the difference between the mean value at a level 1, 0 ⁇ 1 ⁇ L-l and the mean value at level 1+1.
  • these differences are quantized to produce the M-ary pyramid.
  • Each level of the M-ary pyramid will illustrate a pattern over the image that can be used to identify image features like edges and zero-crossings or for implementing motion estimation.
  • a binary pyramid B'(ij) of images can be built as follows:
  • equation (la) illustrates a particular condition (quantizer step) that defines the two values ("0" and "1") of the binary pyramid, other conditions or quantizer steps can be used to define the two values ("0" and "1") of the binary pyramid in accordance with a particular application.
  • This definition has the advantage of noise-robustness if the quantization threshold T (e.g., in the preferred embodiment T is selected to 5) is suitably chosen for a particular application. Namely, it is possible to define a "dead zone", e.g., I ⁇ I ⁇ T, where slight variations in the pixel values due to noise can be removed effectively. Thus, any M-ary pyramids (M>2) having a dead zone around zero will minimize the noise sensitivity problem. In relatively flat areas (areas of low activities), Y"(i, j) will contain a large number of zeros (0), while in regions containing edges, Y"(i, j) will contain a number of ones (1).
  • the blocks in the input image can be classified for the purpose of feature extraction using the M-ary pyramid, Y(i, j).
  • the M-ary pyramid can be used to rapidly detect features in the input image without incurring a high computational overhead.
  • the detected features can be used to enhance the motion estimation process as discussed below or other image processing steps, e.g., segmentation of areas (such as objects) within an image, e.g., by using segmentation module 151.
  • Segmentation is an important image processing step, where important areas in the image can be identified to receive special treatment. For example, the face of a person during a video conferencing application may demand special image processing such as receiving a greater allocation of coding bits. Additionally, segmentation can be employed to identify- large objects where global motion estimation can be performed on these large objects.
  • ternary pyramid shows one possible method in which the quantization thresholds or levels can be assigned for feature identification and classification.
  • M-ary pyramids with M > 2 can be used with the specific assignment of the quantization thresholds being dependent on the requirement of a particular application and/or the content of the image sequence.
  • step 220 the blocks in the frame are classified in terms of low activity or high activity in view of the M-ary pyramid.
  • the "classification block size” is a 8 x8 block having 64 M-ary pixel values represented by 128 bits.
  • An "activity threshold" of 25 is set where the 8 x8 block is classified as a high activity block if 25 or more pixel values are nonzero. Otherwise, the 8 x8 block is classified as a low activity block.
  • Additional higher block classification can be performed, e.g., classifying a macroblock as either a high activity or low activity macroblock.
  • a macroblock comprising at least one subblock that is classified as high activity, causes the macroblock to be classified as high activity as well.
  • classification block size and the “activity threshold” can be adjusted according to a particular application and are not limited to those values selected in the preferred embodiment.
  • step 230 the block classifications are used to enhance the motion estimation process.
  • motion estimates in areas with significant image features are more reliable than motion estimates in relatively "flat areas" with little changes due to the aperture problem (e.g., uniform areas where the content of the image are very similar for adjacent blocks). Therefore, the classification method described above is used to increase the reliability of motion estimates in general. However, it should be understood that it is not necessary to preclassify a block as to its content prior to the use of the M-ary pyramid in a motion estimation application.
  • an M-ary pyramid can be employed directly (as illustrated by a dashed line in FIG. 2) to enhance the performance of various types or different architectures of motion estimation methods.
  • motion estimation is generally performed on a block by block basis in a raster scan order.
  • the computational overhead or cost is generally evenly distributed over all the blocks during the motion estimation process.
  • motion estimation in the "edge" blocks can be performed first using a cost function that depends on Y"(i, j), and/or X'(i, j).
  • a cost function that depends on Y"(i, j), and/or X'(i, j).
  • An example of a cost function could involve a bit-wise XOR operation on the M-ary levels in the pyramid, which can be implemented as a fast method on certain architectures. The cost function is used to determine the "best match".
  • FIG. 5 illustrates an input frame 500 which has been divided and classified into a plurality of blocks 510.
  • two blocks 510a have been classified as high activity blocks.
  • motion estimation is performed on these two blocks first.
  • the computational cost can be increased for these two blocks, since these high activity blocks (high- confidence "edge" blocks), will most likely provide very high accuracy motion vectors.
  • more intensive motion estimations are performed in these two blocks than other blocks in the image frame 500, e.g., the high activity blocks can be split to obtain more accurate motion vectors, "half pel" motion estimation can be performed in these two blocks or finer search strategies may be employed.
  • the motion estimation will then propagate to the low activity blocks ("Low- confidence" blocks) in the image.
  • this propagation is done intelligently depending on the region or object segmentation that is obtained from the classification.
  • This propagation is performed by using the motion of the edge blocks as an initialization for the motion of adjacent blocks, and using a relatively small search-range to refine this initialization. Namely, the motion estimation process propagates (e.g., in a spiraling order) to blocks 510b, where the initial search area is derived from the motion vectors of the high activity blocks.
  • this propagation strategy is then extended to "flat" blocks, e.g., blocks 510c and so on, that do not lie adjacent to an "edge" block, and has the advantage of fast computation since the refinement search-range is relatively small.
  • the motion estimates will be smoother and easier to encode, which is a major advantage in very low bit rate (VLBR) applications where motion information forms a significant portion of the bit-stream.
  • VLBR very low bit rate
  • the classification method also produces computational savings when half-pel refinements are used to increase accuracy of motion estimation.
  • the half-pel refinements are performed only on the "edge" blocks, and not on the relatively flat areas of the image.
  • An alternative embodiment of the present invention involves the use of a plurality of M-ary pyramid architectures or structures (illustrated in FIG. 12) to effect scalable hierarchical motion estimation.
  • a 4-level binary pyramid is constructed as follows:
  • X 1 (i,j) represents the gray level at the position ( , ) of the 1th level and X 0 (/, / ' ) denotes the original image.
  • the M-ary pyramid generated by equations (6) and (7) generates a modified binary pyramid having a highest level of the M-ary pyramid represented by equation (7).
  • This particular M-ary pyramid architecture 1210 is illustrated in FIG. 12.
  • a plurality of M-ary pyramid architectures 1210, 1220, 1230 and 1240 are generated to provide a scalable hierarchical motion estimation method.
  • FIG. 12 illustrates four M-ary pyramid architectures of varying complexities.
  • M-ary pyramid architecture 1210 comprises three (3) one- bit levels (O) 1210a- 1210c and one (1) eight-bit level (E) 1210d.
  • M-ary pyramid architecture 1220 comprises two (2) one-bit levels (O) 1220a- 1220b and two (2) eight-bit levels (E) 1220c- 1220d.
  • M-ary pyramid architecture 1230 comprises one (1) one-bit level (O) 1230a and three (3) eight-bit levels (E) 1230b-1230d.
  • M- ary pyramid architecture 1240 comprises four (4) eight-bit levels (E) 1240a- 1240d. It should be noted that M-ary pyramid architecture 1240 is simply a mean pyramid.
  • motion vectors for level 3 (1210d-1240d) are estimated using full search with "tiling block” sizes of 8 x 8 (710) and 4 x 4 (720), i.e., multi-scale tiling as illustrated in FIG. 7.
  • Multi- scale (or N-scale) tiling is the process of performing motion estimation for a current block of the frame using different "tiling block" sizes. For example, if N is set to three, then three (3) motion vectors are generated for each block within each frame, i.e., the block is "tiled” with three different block sizes or scales. In turn, the motion vectors for level 3 are propagated to level 2 and refined with block sizes of 8 x 8 and 4 x 4.
  • the motion vectors from level 2 are propagated to level 1 and refined with block size of 8 x 8.
  • the motion vectors from level 1 are propagated to level 0 and refined with block size of 16 x 16.
  • the present invention is not limited to a particular number of blocks and block sizes. In fact, any number of blocks and/or block sizes can be implemented with the present invention.
  • N-scale tiling can be implemented in combination with the present invention as disclosed in an accompanying patent application filed simultaneously herewith on 29 June 1998 with the title " Apparatus And Method For Employing M-ary Pyramids With N- Scale Tiling " (attorney docket SAR 12455; serial number 09/106,707, hereby incorporated by reference.
  • Scalable hierarchical motion estimation is achieved by changing an O level into an E level during the hierarchical motion estimation process.
  • the necessary levels for the other M-ary pyramid architectures 1220-1240 are available.
  • the E level 1220c level 2 of the M-ary pyramid
  • the E level 1230b level 1 of the M-ary pyramid
  • the level 1 of a mean pyramid that was previously generated to compute the binary level 1210b and so on.
  • the entire mean pyramid that was generated to derive the M-ary pyramid 1210 is stored in a location e.g., in the memory of a computer system for later use.
  • four (4) hierarchical motion vector estimation architectures are obtained, which are HME iB , HML B , HME iB , and HME 0B to provide a scalable hierarchical motion estimation process.
  • the labels H £ j ⁇ , HME 1B , HME 1B , and HME 0B refer to a hierarchical motion estimation with 3 O layers, 2 O layers, 1 O layer, and 0 O layer as shown in Figure 12.
  • M binary pyramid
  • method 1100 proceeds to step 1125 where an M-ary pyramid is generated in accordance with the selected M-ary pyramid architecture. If the query is answered positively, then method 1100 proceeds to step 1120, where a new M-ary pyramid architecture is selected for the current frame (e.g., changing from architecture 1210 to 1220) and then proceeds to step 1125.
  • step 1130 method 1100 performs hierarchical motion estimation starting from the highest level of the M-ary pyramid. Once motion vectors are generated for the highest level, the motion vectors are passed to a lower level of the M-ary pyramid as discussed above.
  • step 1135 method 1100 queries whether the current M-ary pyramid architecture should be changed for a next level of the M-ary pyramid architecture. Namely, method 1100 can switch to a different M-ary pyramid architecture during the hierarchical motion estimation process. Again, the decision to change a particular M-ary pyramid architecture level can be based on different criteria such as computational complexity, the available memory resources, user's choice and/or the available communication bandwidth. If the query is answered negatively, then method 1100 proceeds to step 1145.
  • step 1140 a new M-ary pyramid architecture (or simply a new level, e.g., substituting an O level with an E level) is selected for the current frame and then proceeds to step 1145.
  • a new M-ary pyramid architecture or simply a new level, e.g., substituting an O level with an E level
  • step 1145 method 1100 queries whether there is a next level for the current M-ary pyramid architecture. If the query is answered negatively, then method 1100 proceeds to step 1150. If the query is answered positively, then method 1100 returns to step 1130, where hierarchical motion estimation is performed on the next level of the M-ary pyramid architecture. In step 1150, method 1100 queries whether there is a next frame in the image sequence. If the query is answered negatively, then method 1100 ends in step 1155. If the query is answered positively, then method 1100 returns to step 1115, where hierarchical motion estimation is performed for the next frame in the image sequence.
  • the present hierarchical motion estimation using a binary pyramid with four levels and four different binary pyramid architectures provides a scalable motion estimation method.
  • the width and the height of the video image is W and H, respectively.
  • the frame rate of the video sequence is F r .
  • the size of the image block is N X N.
  • a picture frame contains —
  • the search window has ⁇ M N pixels.
  • search areas of adjacent blocks may overlap.
  • This overlapped area data can be stored inside an on-chip buffer to reduce external memory bandwidth.
  • a buffer "D" whose size equals to the search area, (N + 2M) x (N + 2M) bytes.
  • the new loading data size for buffer D is Nx(N+2M) bytes when the next block is on the same picture slice.
  • the complete buffer is loaded at the beginning of a slice when processing one picture slice.
  • the total external memory bandwidth per slice is approximately ((N+2M) 2 + ((W/N)-l) x N x (2M+N)) bytes if boundary block cases are neglected.
  • a derivation of the memory bandwidth requirement for HME 3B for a 720 x 480 image is given in the following.
  • the search range at level 3 is set to ⁇ 16 pixels.
  • the search range at level 3 is set to ⁇ 16 pixels.
  • level 1 is set to ⁇ 3 pixels.
  • level 3 the memory bandwidth (bytes) is approximated as:
  • a level 2 the memory bandwidth (bytes) is approximated as: 120 (180 20 180
  • the memory bandwidth requirement for the HME 1B , the HME 2B , and the HME 3B can be derived in the same manner as the above derivation for the
  • HME 0B Table 1 lists memory bandwidth requirements in Mbytes/s for the four binary pyramid architectures.
  • the memory bandwidth requirement of the present invention is scalable from 6.341 Mbytes/s to 29.208 Mbytes/s as the O layer changes into an E layer.
  • the present invention employs a binary pyramid with four levels, the present invention is not so limited. In fact, other M-ary pyramids can be implemented with different levels. Furthermore, the above block classification (high activity and low activity) method can be used to select the levels or the types of the M-ary pyramid architecture to be used. For example, the type of the M-ary pyramid architecture to be selected can be based on the "activities" (high or low) in a particular frame. In fact, any other block classification methods can be used in conjunction with the present invention.
  • FIG. 8 depicts a wavelet-based encoder 800 that incorporates the present invention.
  • the encoder contains a block motion compensator (BMC) and motion vector coder 804, subtractor 802, discrete wavelet transform (DWT) coder 806, bit rate controller 810, DWT decoder 812 and output buffer 814.
  • BMC block motion compensator
  • DWT discrete wavelet transform
  • the input signal is a video image (a two-dimensional array of pixels (pels) defining a frame in a video sequence).
  • the spatial and temporal redundancy in the video frame sequence must be substantially reduced. This is generally accomplished by coding and transmitting only the differences between successive frames.
  • the encoder has three functions: first, it produces, using the BMC and its coder 804, a plurality of motion vectors that represent motion that occurs between frames; second, it predicts the present frame using a reconstructed version of the previous frame combined with the motion vectors; and third, the predicted frame is subtracted from the present frame to produce a frame of residuals that are coded and transmitted along with the motion vectors to a receiver.
  • the discrete wavelet transform performs a wavelet hierarchical subband decomposition to produce a conventional wavelet tree representation of the input image.
  • the image is decomposed using times two subsampling into high horizontal-high vertical (HH), high horizontal- low vertical (HL), low horizontal-high vertical (LH), and low horizontal-low vertical (LL), frequency subbands.
  • the LL subband is then further subsampled times two to produce a set of HH, HL, LH and LL subbands.
  • This subsampling is accomplished recursively to produce an array of subbands such as that illustrated in FIG. 9 where three subsamplings have been used. Preferably six subsamplings are used in practice.
  • the parent-child dependencies between subbands are illustrated as arrows pointing from the subband of the parent nodes to the subbands of the child nodes.
  • the lowest frequency subband is the top left LL, and the highest frequency subband is at the bottom right HH 3 . In this example, all child nodes have one parent.
  • a detailed discussion of subband decomposition is presented in J.M. Shapiro, "Embedded Image Coding Using Zerotrees of Wavelet Coefficients", IEEE Trans, on Signal Processing, Vol. 41, No. 12, pp. 3445-62, December 1993.
  • the DWT coder of FIG. 8 codes the coefficients of the wavelet tree in either a ⁇ c breadth first" or "depth first" pattern.
  • a breadth first pattern traverse the wavelet tree in a bit-plane by bit-plane pattern, i.e., quantize all parent nodes, then all children, then all grandchildren and so on.
  • a depth first pattern traverses each tree from the root in the low-low subband (LL,) through the children (top down) or children through the low-low subband
  • FIG. 6 illustrates an encoding system 600 of the present invention.
  • the encoding system comprises a general purpose computer 610 and various input/output devices 620.
  • the general purpose computer comprises a central processing unit (CPU) 612, a memory 614 and an encoder 616 for receiving and encoding a sequence of images.
  • CPU central processing unit
  • the encoder 616 is simply the encoder 100 and 800 as discussed above.
  • the encoder 616 can be a physical device which is coupled to the CPU 612 through a communication channel.
  • the encoder 616 can be represented by a software application (or a combination of software and hardware, e.g., application specific integrated circuits (ASIC)), which is loaded from a storage device, e.g., a magnetic or optical disk, and resides in the memory 612 of the computer.
  • ASIC application specific integrated circuits
  • the encoder 100 of the present invention can be stored on a computer readable medium.
  • the computer 610 can be coupled to a plurality of input and output devices 620, such as a keyboard, a mouse, a camera, a camcorder, a video monitor, any number of imaging devices or storage devices, including but not limited to, a tape drive, a floppy drive, a hard disk drive or a compact disk drive.
  • the input devices serve to provide inputs to the computer for producing the encoded video bitstreams or to receive the sequence of video images from a storage device or an imaging device.
  • a communication channel 630 is shown where the encoded signal from the encoding system is forwarded to a decoding system (not shown).

Abstract

An apparatus and a concomitant method for performing hierarchical block-based motion estimation with a high degree of scalability is disclosed. The present invention decomposes each of the image frames within an image sequence into an M-ary pyramid. Different dynamic ranges for representing the pixel values are used for different levels of the M-ary pyramid, thereby generating a plurality of different 'P-bit' levels, i.e., a plurality of different M-ary pyramid architectures. The present scalable hierarchical motion estimation provides the flexibility of switching from one M-ary pyramid architecture to another M-ary pyramid architecture according to the available platform sources and/or user's choice.

Description

APPARATUS AND METHOD FOR PERFORMING SCALABLE HIERARCHICAL MOTION ESTIMATION
This is a continuation-in-part of Application No. 09/002,258, filed on December 31, 1997.
The invention relates generally to a system for encoding image sequences and, more particularly, to an apparatus and a concomitant method for performing hierarchical block-based motion estimation with a high degree of scalability.
BACKGROUND OF THE INVENTION An image sequence, such as a video image sequence, typically includes a sequence of image frames or pictures. The reproduction of video containing moving objects typically requires a frame speed of thirty image frames per second, with each frame possibly containing in excess of a megabyte of information. Consequently, transmitting or storing such image sequences requires a large amount of either transmission bandwidth or storage capacity. To reduce the necessary transmission bandwidth or storage capacity, the frame sequence is compressed such that redundant information within the sequence is not stored or transmitted. Television, video conferencing and CD-ROM archiving are examples of applications which can benefit from efficient video sequence encoding.
Generally, to encode an image sequence, information concerning the motion of objects in a scene from one frame to the next plays an important role in the encoding process. Because of the high redundancy that exists between consecutive frames within most image sequences, substantial data compression can be achieved using a technique known as motion estimation/compensation. In brief, the encoder only encodes the differences relative to areas that are shifted with respect to the areas coded. Namely, motion estimation is a process of determining the direction and magnitude of motion (motion vectors) for an area (e.g., a block or macroblock) in the current frame relative to one or more reference frames. "Whereas, motion compensation is a process of using the motion vectors to generate a prediction (predicted image) of the current frame. The difference between the current frame and the predicted frame results in a residual signal (error signal), which contains substantially less information than the current frame itself. Thus, a significant saving in coding bits is realized by encoding and transmitting only the residual signal and the corresponding motion vectors.
However, encoder designers must address the dichotomy of attempting to increase the precision of the motion estimation process to minimize the residual signal (i.e., reducing coding bits) or accepting a lower level of precision in the motion estimation process to minimize the computational overhead. Namely, determining the motion vectors from the frame sequence requires intensive searching between frames to determine the motion information. A more intensive search will generate a more precise set of motion vectors at the expense of more computational cycles.
To illustrate, some systems determine motion information using a so-called block based approach. In a simple block based approach, the current frame is divided into a number of blocks of pixels (referred to hereinafter as the "current blocks"). For each of these current blocks, a search is performed within a selected search area in the preceding frame for a block of pixels that "best" matches the current block. This search is typically accomplished by repetitively comparing a selected current block to similarly sized blocks of pixels in the selected search area of the preceding frame. However, the determination of motion vectors by this exhaustive search approach is computationally intensive, especially where the search area is particularly large.
Alternatively, other motion estimation methods incorporate the concept of hierarchical motion estimation (HME), where an image is decomposed into a multiresolution framework, i.e., a pyramid. A hierarchical motion vector search is then performed, where the search proceeds from the lowest resolution to the highest resolution of the pyramid. Although HME has been demonstrated to be a fast and effective motion estimation method, the generation of the pyramid still incurs a significant amount of computational cycles. Furthermore, the above motion estimation methods are not easily scalable. Namely, the architecture of these motion estimation methods do not provide a user or an encoder with the flexibility to scale or switch to a different architecture to account for available computational resources and/or user's choices.
Therefore, there is a need in the art for an apparatus and a concomitant method for a hierarchical block-based motion estimation with a high degree of scalability.
SUMMARY OF THE INVENTION
An embodiment of the present invention is an apparatus and method for performing hierarchical block-based motion estimation with a high degree of scalability. The present scalable hierarchical motion estimation architecture provides the flexibility of switching from one-bit/pixel to eight-bit/pixel representation according to the available platform resources and/or user's choice. More specifically, the present invention decomposes each of the image frames within an image sequence into an M-ary pyramid, e.g., a four level binary pyramid. Different dynamic ranges for representing the pixel values are used for different levels of the binary pyramid, thereby generating a plurality of different "P-bit" levels.
For example, eight bits are used to represent each pixel value (eight- bit/pixel (P=8)) at the highest level of the M-ary pyramids, whereas one bit is used to represent each pixel value (one-bit/pixel (P=l)) at all other levels of the M-ary pyramids. Scalable hierarchical motion estimation is achieved by changing the dynamic ranges of the levels of the M-ary pyramid, i.e., implementing different combinations of eight-bit pixel layers with one-bit layers (levels) to produce a plurality of M-ary pyramids of varying complexities. Thus, the scalability of the hierarchial motion estimation can be implemented to be responsive to computational complexity, memory requirement and/or communication bandwidth, thereby providing features such as platform-adaptive encoding and computing. BRIEF DESCRIPTION OF THE DRAWINGS The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which: FIG. 1 illustrates a block diagram of the encoder of the present invention;
FIG. 2 illustrates a flowchart of a method for reducing the computational complexity in determining motion vectors for block-based motion estimation; FIG. 3 illustrates a block diagram of a general mean pyramid; FIG. 4 illustrates a block diagram of the quantization process that generates an M-ary pyramid;
FIG. 5 illustrates an input frame which has been divided and classified into a plurality of blocks;
FIG. 6 illustrates an encoding system of the present invention; FIG. 7 illustrates a block diagram of a block of pixels with multi-scale tiling;
FIG. 8 illustrates a block diagram of a second embodiment of the apparatus of the present invention;
FIG. 9 illustrates a graphical representation of a wavelet tree; FIG. 10 illustrates a flowchart of a method for generating an M-ary pyramid for an image;
FIG. 11 illustrates a flowchart of a method for performing scalable motion estimation on an M-ary pyramid; and
FIG. 12 illustrates a block diagram of a plurality of different M-ary pyramid architectures.
DETAILED DESCRIPTION FIG. 1 depicts a block diagram of the apparatus 100 of the present invention for reducing the computational complexity in determining motion vectors for block-based motion estimation. The preferred embodiment of the present invention is described below using an encoder, but it should be understood that the present invention can be employed in image processing systems in general. Furthermore, the present invention can be employed in encoders that are in compliant with various coding standards. These standards include, but are not limited to, the Moving Picture Experts Group Standards (e.g., MPEG-1 (11172-*) and MPEG-2 (13818-*), H.261 and H.263.
The apparatus 100 is an encoder or a portion of a more complex block-based motion compensated coding system. The apparatus 100 comprises a motion estimation module 140, a motion compensation module 150, an optional segmentation module 151, a preprocessing module 120, a rate control module 130, a transform module, (e.g., a DCT module) 160, a quantization module 170, a coder, (e.g., a variable length coding module) 180, a buffer 190, an inverse quantization module 175, an inverse transform module (e.g., an inverse DCT module) 165, a subtractor 115 and a summer 155. Although the encoder 100 comprises a plurality of modules, those skilled in the art will realize that the functions performed by the various modules are not required to be isolated into separate modules as shown in FIG. 1. For example, the set of modules comprising the motion compensation module 150, inverse quantization module 175 and inverse DCT module 165 is generally known as an "embedded decoder". FIG. 1 illustrates an input image (image sequence) on path 110 which is digitized and represented as a luminance and two color difference signals (Y, Cr, Cb) in accordance with the MPEG standards. These signals are further divided into a plurality of layers such that each picture (frame) is represented by a plurality of macroblocks. Each macroblock comprises four (4) luminance blocks, one Cr block and one Cb block where a block is defined as an eight (8) by eight (8) sample array. The division of a picture into block units improves the ability to discern changes between two successive pictures and improves image compression through the elimination of low amplitude transformed coefficients (discussed below).
The following disclosure uses the MPEG standard terminology; however, it should be understood that the term macroblock or block in the present invention is intended to describe a block of pixels of any size or shape that is used for the basis of encoding. Broadly speaking, a "macroblock" could be as small as a single pixel, or as large as an entire video frame. In the preferred embodiment, the digitized input image signal undergoes one or more preprocessing steps in the preprocessing module 120. More specifically, preprocessing module 120 comprises an M-ary pyramid generator 122 and a block classifier 124. The M-ary pyramid generator 122 employs a mean filter 123a and a quantizer 123b to filter and to quantize each frame into a plurality of different resolutions, i.e., an M-ary pyramid of resolutions, where the different resolutions of each frame are correlated in a hierarchical fashion as described below. In turn, using the pyramid of resolutions, the block classifier 124 is able to quickly classify areas (blocks) as areas of high activity or low activity. A detailed description is provided below for the functions performed by the preprocessing module 120.
The input image on path 110 is also received into motion estimation module 140 for estimating motion vectors. A motion vector is a two-dimensional vector which is used by motion compensation to provide an offset from the coordinate position of a block in the current picture to the coordinates in a reference frame. The use of motion vectors greatly enhances image compression by reducing the amount of information that is transmitted on a channel because only the changes within the current frame are coded and transmitted. In the preferred embodiment, the motion estimation module 140 also receives information from the preprocessing module 120 to enhance the performance of the motion estimation process.
The motion vectors from the motion estimation module 140 are received by the motion compensation module 150 for improving the efficiency of the prediction of sample values. Motion compensation involves a prediction that uses motion vectors to provide offsets into the past and/or future reference frames containing previously decoded sample values, and is used to form the prediction error. Namely, the motion compensation module 150 uses the previously decoded frame and the motion vectors to construct an estimate (motion compensated prediction or predicted image) of the current frame on path 152. This motion compensated prediction is subtracted via subtractor 115 from the input image on path 110 in the current macroblocks to form an error signal (e) or predictive residual on path 153. The predictive residual signal is passed to a transform module, e.g., a DCT module 160. The DCT module then applies a forward discrete cosine transform process to each block of the predictive residual signal to produce a set of eight (8) by eight (8) block of DCT coefficients. The discrete cosine transform is an invertible, discrete orthogonal transformation where the DCT coefficients represent the amplitudes of a set of cosine basis functions.
The resulting 8 x 8 block of DCT coefficients is received by quantization (Q) module 170, where the DCT coefficients are quantized. The process of quantization reduces the accuracy with which the DCT coefficients are represented by dividing the DCT coefficients by a set of quantization values or scales with appropriate rounding to form integer values. The quantization values can be set individually for each DCT coefficient, using criteria based on the visibility of the basis functions (known as visually weighted quantization). By quantizing the DCT coefficients with this value, many of the DCT coefficients are converted to zeros, thereby improving image compression efficiency.
Next, the resulting 8 x 8 block of quantized DCT coefficients is received by a coder, e.g., variable length coding module 180 via signal connection 171, where the two-dimensional block of quantized coefficients is scanned in a "zig-zag" order to convert it into a one-dimensional string of quantized DCT coefficients. Variable length coding (VLC) module 180 then encodes the string of quantized DCT coefficients and all side-information for the macroblock such as macroblock type and motion vectors. Thus, the VLC module 180 performs the final step of converting the input image into a valid data stream.
The data stream is received into a buffer, e.g., a "First In-First Out" (FIFO) buffer 190. A consequence of using different picture types and variable length coding is that the overall bit rate is variable. Namely, the number of bits used to code each frame can be different. Thus, in applications that involve a fixed-rate channel, a FIFO buffer is used to match the encoder output to the channel for smoothing the bit rate. Thus, the output signal on path 195 from FIFO buffer 190 is a compressed representation of the input image 110, where it is sent to a storage medium or a telecommunication channel. The rate control module 130 serves to monitor and adjust the bit rate of the data stream entering the FIFO buffer 190 to prevent overflow and underflow on the decoder side (within a receiver or target storage device, not shown) after transmission of the data stream. A fixed-rate channel is assumed to carry bits at a constant rate to an input buffer within the decoder (not shown). At regular intervals determined by the picture rate, the decoder instantaneously removes all the bits for the next picture from its input buffer. If there are too few bits in the input buffer, i.e., all the bits for the next picture have not been received, then the input buffer underflows resulting in an error. Similarly, if there are too many bits in the input buffer, i.e., the capacity of the input buffer is exceeded between picture starts, then the input buffer overflows resulting in an overflow error. Thus, it is the task of the rate control module 130 to monitor the status of buffer 190 to control the number of bits generated by the encoder, thereby preventing the overflow and underflow conditions. A rate control method may control the number of coding bits by adjusting the quantization scales.
Furthermore, the resulting 8 x 8 block of quantized DCT coefficients from the quantization module 170 is received by the inverse quantization module 175 and inverse DCT module 165 via signal connection 172. In brief, at this stage, the encoder regenerates I-frames and P-frames of the image sequence by decoding the data so that they are used as reference frames for subsequent encoding.
FIG. 2 illustrates a flowchart of a method 200 for reducing the computational complexity in determining motion vectors for block-based motion estimation. Namely, the method 200 enhances a block-based motion estimation method by quickly defining an initial search area where a match will likely occur.
More specifically, method 200 starts in step 205 and proceeds to step 210 where an M-ary pyramid (or M-ary mean pyramid) is generated for each image frame in the image sequence. A detailed description on the method of generating an M-ary pyramid is provided below with reference to FIGs. 3, 4, and 10.
More specifically, FIG. 10 illustrates a flowchart of a method 1000 for generating an M-ary pyramid for an image. The method starts in step 1005 and proceeds to step 1010 where the original image is decomposed into a mean pyramid of images as illustrated in FIG. 3.
FIG. 3 illustrates a block diagram of a general mean pyramid 300, where the mean pyramid comprises a plurality of levels 310, 320 and 330. The lowest level 310 is an original image frame from the image sequence having a plurality of pixels 311 represented by "x"s. Typically, these pixels are represented by pixel values having a dynamic range that is limited by the number of bits allocated to represent the pixel values. For example, if eight (8) bits are allocated, then a pixel value may take a value from one of 256 possible values. In a mean pyramid, a next higher level is generated by lowpass filtering and downsampling by a factor of two in both directions, thereby generating a single pixel value (parent) for a higher level from four (4) pixel values (children) in a lower level. This is illustrated in FIG. 3, where each set of four pixels 312a- d is used to generate a single pixel value 321 in level 320. In turn, the set of four pixel values 322a is used to generate a single pixel value 331 in level 330 and so on. It should be understood that the present invention is not limited to a mean pyramid having three levels. The number of levels is generally limited by the size of the image and the downsampling factor selected to generate the next lower resolution image. Thus, the number of levels in the mean pyramid can be selected for a particular application.
In a mean pyramid, the parent pixel value is derived by taking the average of its four children pixel values, thus the term mean pyramid. However, other measure or metric can be used to generate other types of pyramids, e.g., the measure can be based on the median of the four children pixel values. Alternatively, a larger area around the children pixels can be used for a weighted average to obtain a general lowpass pyramid. For the purpose of this invention, each of these different types of pyramids having a set of pyramidal images (e.g., a mean pyramid, a median pyramid, a lowpass pyramid and the like) can be broadly classified as an "image pyramid". From this image pyramid, the M-ary pyramid is generated.
Returning to FIG. 10, method 1000 then generate an M-ary pyramid from said mean pyramid in step 1020. Namely, in an "M-ary pyramid", the pixel values are quantized such that each quantized pixel value can only take "M" possible pixel values as illustrated in FIG. 4 below. For example, if M equals to two (2), then each quantized pixel value, can take on a value of 0 or 1, i.e., resulting in a "binary pyramid". Thus, different dynamic ranges for representing the pixel values are used for different levels of the binary pyramid, thereby generating a plurality of different "P-bit" levels or layers. Furthermore, a level of the M-ary pyramid having eight bit/pixels is referred to as an 8-bit level (illustrated as an Ε" level in FIG. 12), whereas a level of the M-ary pyramid having one bit/pixels (e.g., Boolean) is referred to as an 1-bit level (illustrated as an "O" level in FIG. 12). Thus, the mean pyramid 300 as discussed above comprises a plurality of Ε" levels. The distinction and combinatorial use of these "E" and "O" levels are discussed further below.
FIG. 4 illustrates a block diagram of the quantization process that generates a ternary pyramid, where M equals to three (3). More specifically, an eight-bit pixel value 255 (410a) is quantized into a two-bit pixel value 10 (420a) based on the difference between the child and parent pixels. Namely, a difference is computed between a parent 430a and each of its children 410a-d, where each of the four (4) differences is then quantized into three possible values 10, 00, and 01. Thus, pixel value 128 (410b and 410c) is quantized into a pixel value 00 (420b and 420c) and pixel value 0 (410d) is quantized into a pixel value 01 (420d). These representation levels are suitable for the bit wise XOR based cost function that will be used for motion estimation. They are also useful for feature detection and block classification. The M-ary pyramid reduces accuracy of the pixel values, thereby allowing rapid detection of "features" within an image. Features are defined as areas of high activities or intensity, e.g., the edges of an object. It should be noted that the levels 410 and 430 are levels of a mean pyramid, while level 420 is a level of a M-ary pyramid (where M=3). Both of these pyramids may have additional levels as illustrated in FIG. 4, but the M- ary pyramid will have one level less than the mean pyramid. Namely, one needs two mean pyramid levels 410 and 430 to generate a single M-ary pyramid level 420. Furthermore, the significant reduction in the number of bits used to represent the pixel values translates into a reduction in computational overhead in the motion estimation process. For example, the block matching operation performed in the motion estimation process can be accelerated since there are fewer possible values that a pixel value can take on, thereby simplifying the overall block matching process.
Although M can be any value, it has been found that 'lower order" M-ary pyramid, e.g., a binary pyramid decomposition can be more sensitive to noise than a "higher order" M-ary pyramid, e.g., a ternary pyramid. Namely, since the quantized pixel values in a binary pyramid can only take one of two possible values, noise may introduce errors, where a pixel value can be erroneously interpreted as having a value 1 instead of 0 or vice versa. However, a "higher order" M-ary pyramid requires more computational overhead. Thus, although it has been observed that an M-ary pyramid decomposition is best employed when M is greater than 2, the selection of a particular M-ary pyramid decomposition is often dictated by the requirement of the particular application. Once the M-ary pyramid is generated, method 1000 ends in step 1030 and returns to step 220 of FIG. 2.
It should be understood that the important aspect in step 210 is the generation of an M-ary pyramid for each of the input images in the image sequence. As such, although the preferred embodiment generates an M-ary mean pyramid, other types of M-ary pyramids can be employed in the present invention, e.g., an M-ary median pyramid, an M-ary Lowpass pyramid and so on. Alternately, the inventive concept of a M-ary mean pyramid decomposition can be expressed in equation form. Let (i, j) represent the pixel locations on an image frame and let I(i, j) represent the intensity at location (i, j). Further, let 1 indicate the level within a pyramid, with 0<1<L, where L is the highest level in the pyramid. Then, the mean pyramids X'(i, j),l≤l<L are constructed as follows:
X O',;) = ∑∑^ (2/ + ,2y +n) (1)
4 m=0n=0 where X° (i, j) = I(i, j).
Returning to FIG. 2, from these mean pyramids, information such as features within a block can be extracted in step 220 below. In one embodiment, the block is a 8 x 8 subblock of a macroblock, but it should be understood that the present invention is not limited to this block size. In particular, features like edges can be extracted from the variation of intensities within a block. This variation is represented by calculating the difference between the mean value at a level 1, 0 < 1 < L-l and the mean value at level 1+1. However, in order to obtain a robust feature, and in order to facilitate fast motion estimation, these differences are quantized to produce the M-ary pyramid. Each level of the M-ary pyramid will illustrate a pattern over the image that can be used to identify image features like edges and zero-crossings or for implementing motion estimation. For example, a binary pyramid B'(ij) of images can be built as follows:
Figure imgf000014_0001
where 1 indicates a level within the binary pyramid. Although equation (la) illustrates a particular condition (quantizer step) that defines the two values ("0" and "1") of the binary pyramid, other conditions or quantizer steps can be used to define the two values ("0" and "1") of the binary pyramid in accordance with a particular application.
Alternatively, a ternary pyramid Y"(i, j) of images can be built. For example, denoting the pattern value in the M-ary (M=3) pyramid by Y"(i, j):
Y' (i ) = Quantι X'(i )- X /+! i( I.N. Tf -i ,I.N. Tf -Jλ
\ X'(i )- X 0 ≤ / < - l (2)
{ V2, V2. J]
Denote the argument of Quant[-] by λ. For example, consider the case of ternary pyramids having a threshold T, and define Y"(i, j) as follows: [00 |λ| < 7 Y'(i,j) = i O\ λ > T (3) 10 λ < -T
This definition has the advantage of noise-robustness if the quantization threshold T (e.g., in the preferred embodiment T is selected to 5) is suitably chosen for a particular application. Namely, it is possible to define a "dead zone", e.g., I λ I < T, where slight variations in the pixel values due to noise can be removed effectively. Thus, any M-ary pyramids (M>2) having a dead zone around zero will minimize the noise sensitivity problem. In relatively flat areas (areas of low activities), Y"(i, j) will contain a large number of zeros (0), while in regions containing edges, Y"(i, j) will contain a number of ones (1). Once the input image is decomposed into an M-ary pyramid, the blocks in the input image can be classified for the purpose of feature extraction using the M-ary pyramid, Y(i, j). Namely, the M-ary pyramid can be used to rapidly detect features in the input image without incurring a high computational overhead. The detected features can be used to enhance the motion estimation process as discussed below or other image processing steps, e.g., segmentation of areas (such as objects) within an image, e.g., by using segmentation module 151. Segmentation is an important image processing step, where important areas in the image can be identified to receive special treatment. For example, the face of a person during a video conferencing application may demand special image processing such as receiving a greater allocation of coding bits. Additionally, segmentation can be employed to identify- large objects where global motion estimation can be performed on these large objects.
It should be understood that the preceding discussion uses the ternary pyramid as an example and shows one possible method in which the quantization thresholds or levels can be assigned for feature identification and classification. In general, M-ary pyramids with M > 2 can be used with the specific assignment of the quantization thresholds being dependent on the requirement of a particular application and/or the content of the image sequence.
Returning to FIG. 2, after the M-ary pyramid is generated, method 200 proceeds to step 220 where the blocks in the frame are classified in terms of low activity or high activity in view of the M-ary pyramid. In the preferred embodiment, the "classification block size" is a 8 x8 block having 64 M-ary pixel values represented by 128 bits. An "activity threshold" of 25 is set where the 8 x8 block is classified as a high activity block if 25 or more pixel values are nonzero. Otherwise, the 8 x8 block is classified as a low activity block. Additional higher block classification can be performed, e.g., classifying a macroblock as either a high activity or low activity macroblock. In the preferred embodiment, a macroblock comprising at least one subblock that is classified as high activity, causes the macroblock to be classified as high activity as well. It should be understood that the "classification block size" and the "activity threshold" can be adjusted according to a particular application and are not limited to those values selected in the preferred embodiment.
Returning to FIG. 2, after block classification, method 200 proceeds to step 230 where the block classifications are used to enhance the motion estimation process. Generally, motion estimates in areas with significant image features are more reliable than motion estimates in relatively "flat areas" with little changes due to the aperture problem (e.g., uniform areas where the content of the image are very similar for adjacent blocks). Therefore, the classification method described above is used to increase the reliability of motion estimates in general. However, it should be understood that it is not necessary to preclassify a block as to its content prior to the use of the M-ary pyramid in a motion estimation application.
Namely, it should be understood that the present invention of an M-ary pyramid can be employed directly (as illustrated by a dashed line in FIG. 2) to enhance the performance of various types or different architectures of motion estimation methods.
More specifically, motion estimation is generally performed on a block by block basis in a raster scan order. The computational overhead or cost is generally evenly distributed over all the blocks during the motion estimation process. In the present invention, motion estimation in the "edge" blocks (high activity blocks) can be performed first using a cost function that depends on Y"(i, j), and/or X'(i, j). This approach allows the emphasis of the features in the image and provide robust, reliable motion estimates in the presence of sensor noise, quantization noise, and illumination changes. An example of a cost function could involve a bit-wise XOR operation on the M-ary levels in the pyramid, which can be implemented as a fast method on certain architectures. The cost function is used to determine the "best match". Let us consider an M-ary valued block at time t (current frame), Y*(i, j, t) and another M-ary valued block at time t-1 (previous frame) Y'(m,n,t-1). The cost function is then expressed as:
∑ Number of ones in
Figure imgf000017_0001
(i,j.t)® Y1 (m.n - l)} (4) pixels within the block where ® represents a bitwise XOR operation. This cost function produces substantial computational savings compared to the standard "absolute difference" cost function used on the original 8-bit pixel intensity values. This procedure is performed hierarchically over the M-ary pyramid.
In other words, the motion estimation method is initiated at the high activity blocks. FIG. 5 illustrates an input frame 500 which has been divided and classified into a plurality of blocks 510. In the preferred embodiment, two blocks 510a have been classified as high activity blocks. As such, motion estimation is performed on these two blocks first. In fact, the computational cost can be increased for these two blocks, since these high activity blocks (high- confidence "edge" blocks), will most likely provide very high accuracy motion vectors. Thus, more intensive motion estimations are performed in these two blocks than other blocks in the image frame 500, e.g., the high activity blocks can be split to obtain more accurate motion vectors, "half pel" motion estimation can be performed in these two blocks or finer search strategies may be employed.
In turn, after motion estimation is completed for the high activity blocks, the motion estimation will then propagate to the low activity blocks ("Low- confidence" blocks) in the image. However, this propagation is done intelligently depending on the region or object segmentation that is obtained from the classification. This propagation is performed by using the motion of the edge blocks as an initialization for the motion of adjacent blocks, and using a relatively small search-range to refine this initialization. Namely, the motion estimation process propagates (e.g., in a spiraling order) to blocks 510b, where the initial search area is derived from the motion vectors of the high activity blocks. In turn, this propagation strategy is then extended to "flat" blocks, e.g., blocks 510c and so on, that do not lie adjacent to an "edge" block, and has the advantage of fast computation since the refinement search-range is relatively small. Furthermore, the motion estimates will be smoother and easier to encode, which is a major advantage in very low bit rate (VLBR) applications where motion information forms a significant portion of the bit-stream. Furthermore, these smoother motion estimates can be expected to perform better in a temporal interpolation application.
Finally, the classification method also produces computational savings when half-pel refinements are used to increase accuracy of motion estimation. The half-pel refinements are performed only on the "edge" blocks, and not on the relatively flat areas of the image.
An alternative embodiment of the present invention involves the use of a plurality of M-ary pyramid architectures or structures (illustrated in FIG. 12) to effect scalable hierarchical motion estimation. For example, in this alternative embodiment, a 4-level binary pyramid is constructed as follows:
Figure imgf000018_0001
l < / < 3
where X1 (i,j) represents the gray level at the position ( , ) of the 1th level and X0(/, /') denotes the original image.
Secondly, the 4-level binary pyramidal images are built as follows:
Figure imgf000019_0001
where 0 < / < 2
Figure imgf000019_0002
It should be noted that the M-ary pyramid generated by equations (6) and (7) generates a modified binary pyramid having a highest level of the M-ary pyramid represented by equation (7). Namely, the highest level of the M-ary pyramid (e.g., binary pyramid (M=2)) is replaced with the highest level of the mean pyramid. This particular M-ary pyramid architecture 1210 is illustrated in FIG. 12. In the preferred embodiment, a plurality of M-ary pyramid architectures 1210, 1220, 1230 and 1240 are generated to provide a scalable hierarchical motion estimation method.
More specifically, FIG. 12 illustrates four M-ary pyramid architectures of varying complexities. M-ary pyramid architecture 1210 comprises three (3) one- bit levels (O) 1210a- 1210c and one (1) eight-bit level (E) 1210d. M-ary pyramid architecture 1220 comprises two (2) one-bit levels (O) 1220a- 1220b and two (2) eight-bit levels (E) 1220c- 1220d. M-ary pyramid architecture 1230 comprises one (1) one-bit level (O) 1230a and three (3) eight-bit levels (E) 1230b-1230d. M- ary pyramid architecture 1240 comprises four (4) eight-bit levels (E) 1240a- 1240d. It should be noted that M-ary pyramid architecture 1240 is simply a mean pyramid.
In operation, for all four M-ary pyramid architectures, motion vectors for level 3 (1210d-1240d) are estimated using full search with "tiling block" sizes of 8 x 8 (710) and 4 x 4 (720), i.e., multi-scale tiling as illustrated in FIG. 7. Multi- scale (or N-scale) tiling is the process of performing motion estimation for a current block of the frame using different "tiling block" sizes. For example, if N is set to three, then three (3) motion vectors are generated for each block within each frame, i.e., the block is "tiled" with three different block sizes or scales. In turn, the motion vectors for level 3 are propagated to level 2 and refined with block sizes of 8 x 8 and 4 x 4. The motion vectors from level 2 are propagated to level 1 and refined with block size of 8 x 8. The motion vectors from level 1 are propagated to level 0 and refined with block size of 16 x 16. However, the present invention is not limited to a particular number of blocks and block sizes. In fact, any number of blocks and/or block sizes can be implemented with the present invention. For example, N-scale tiling can be implemented in combination with the present invention as disclosed in an accompanying patent application filed simultaneously herewith on 29 June 1998 with the title " Apparatus And Method For Employing M-ary Pyramids With N- Scale Tiling " (attorney docket SAR 12455; serial number 09/106,707, hereby incorporated by reference.
Scalable hierarchical motion estimation is achieved by changing an O level into an E level during the hierarchical motion estimation process. It should be noted that once the M-ary pyramid architecture 1210 is generated, the necessary levels for the other M-ary pyramid architectures 1220-1240 are available. For example, the E level 1220c (level 2 of the M-ary pyramid) is simply the level 2 of a mean pyramid that was previously generated to compute the binary level 1210c. Similarly, the E level 1230b (level 1 of the M-ary pyramid) is simply the level 1 of a mean pyramid that was previously generated to compute the binary level 1210b and so on. Thus, the entire mean pyramid that was generated to derive the M-ary pyramid 1210 is stored in a location e.g., in the memory of a computer system for later use. In this fashion, four (4) hierarchical motion vector estimation architectures are obtained, which are HMEiB, HML B, HMEiB, and HME0B to provide a scalable hierarchical motion estimation process. The labels H £j β, HME1B, HME1B, and HME0B refer to a hierarchical motion estimation with 3 O layers, 2 O layers, 1 O layer, and 0 O layer as shown in Figure 12.
FIG. 11 illustrates a flowchart of a method 1100 for performing a scalable hierarchical motion estimation on an M-ary pyramid. More specifically, method 1100 starts in step 1105 and proceeds to step 1110 where an initial M-ary pyramid architecture (e.g., the binary pyramid (M=2) of equations 6 and 7) is selected for a frame in the image sequence. In step 1115, method 1100 queries whether the current M-ary pyramid architecture should be changed. The decision to change a particular M-ary pyramid architecture can be based on one or more criteria such as computational complexity, the available memory resources, memory bandwidth, user's choice and/or the available communication bandwidth. If the query is answered negatively, then method 1100 proceeds to step 1125 where an M-ary pyramid is generated in accordance with the selected M-ary pyramid architecture. If the query is answered positively, then method 1100 proceeds to step 1120, where a new M-ary pyramid architecture is selected for the current frame (e.g., changing from architecture 1210 to 1220) and then proceeds to step 1125.
In step 1130, method 1100 performs hierarchical motion estimation starting from the highest level of the M-ary pyramid. Once motion vectors are generated for the highest level, the motion vectors are passed to a lower level of the M-ary pyramid as discussed above. In step 1135, method 1100 queries whether the current M-ary pyramid architecture should be changed for a next level of the M-ary pyramid architecture. Namely, method 1100 can switch to a different M-ary pyramid architecture during the hierarchical motion estimation process. Again, the decision to change a particular M-ary pyramid architecture level can be based on different criteria such as computational complexity, the available memory resources, user's choice and/or the available communication bandwidth. If the query is answered negatively, then method 1100 proceeds to step 1145. If the query is answered positively, then method 1100 proceeds to step 1140, where a new M-ary pyramid architecture (or simply a new level, e.g., substituting an O level with an E level) is selected for the current frame and then proceeds to step 1145.
In step 1145, method 1100 queries whether there is a next level for the current M-ary pyramid architecture. If the query is answered negatively, then method 1100 proceeds to step 1150. If the query is answered positively, then method 1100 returns to step 1130, where hierarchical motion estimation is performed on the next level of the M-ary pyramid architecture. In step 1150, method 1100 queries whether there is a next frame in the image sequence. If the query is answered negatively, then method 1100 ends in step 1155. If the query is answered positively, then method 1100 returns to step 1115, where hierarchical motion estimation is performed for the next frame in the image sequence.
The present hierarchical motion estimation using a binary pyramid with four levels and four different binary pyramid architectures provides a scalable motion estimation method. To illustrate, assume the width and the height of the video image is W and H, respectively. The frame rate of the video sequence is Fr. Assume that the size of the image block is N X N. A picture frame contains —
picture slices, and there are — blocks in each slice. The search window has ±M N pixels.
In a block matching motion estimation method, search areas of adjacent blocks may overlap. This overlapped area data can be stored inside an on-chip buffer to reduce external memory bandwidth. Assume a buffer "D" whose size equals to the search area, (N + 2M) x (N + 2M) bytes. The new loading data size for buffer D is Nx(N+2M) bytes when the next block is on the same picture slice.
The complete buffer is loaded at the beginning of a slice when processing one picture slice. Thus, the total external memory bandwidth per slice is approximately ((N+2M)2 + ((W/N)-l) x N x (2M+N)) bytes if boundary block cases are neglected. A derivation of the memory bandwidth requirement for HME3B for a 720 x 480 image is given in the following.
The search range at level 3 is set to ±16 pixels. The search range at level
0, level 1, and level 2 is set to ±3 pixels. At level 3, the memory bandwidth (bytes) is approximated as:
60 60 ( ,„ „-2 f 90 β, « x (4 + 32?+(? 1 x4 x (32 + 4) +— x (8 + 32) + — - 1 x 8 x(32 + 8)
4 V
8)
A level 2, the memory bandwidth (bytes) is approximated as: 120 (180 20 180
MB, x (4 + 6) + -1 x4x(6 + 4) + 1 x. x it (8 + 6) + -1 x8x(6 + 8)
V 4 ^
9)
At level 1, the memory bandwidth (bytes) is approximated as
„ 240
MR, 2x x (8 + 6)2+[— -l|x8x(6 + 8) (10)
At level 0, the memory bandwidth (bytes) is approximated as:
480 720
MBf, « x (16 + 6)*+ 1 xl6x(6+16) (ID
16 16
Therefore, the memory bandwidth (bytes/s) of the HME0B is approximated as:
MBHMEϋB » Fr x (MB0 + MR, + MR, + MB,) (12)
The memory bandwidth requirement for the HME1B, the HME2B, and the HME3B can be derived in the same manner as the above derivation for the
HME0B. Table 1 lists memory bandwidth requirements in Mbytes/s for the four binary pyramid architectures.
Figure imgf000024_0001
It can be seen from Table 1 that the memory bandwidth requirement of the present invention is scalable from 6.341 Mbytes/s to 29.208 Mbytes/s as the O layer changes into an E layer.
It should be noted that although the present invention employs a binary pyramid with four levels, the present invention is not so limited. In fact, other M-ary pyramids can be implemented with different levels. Furthermore, the above block classification (high activity and low activity) method can be used to select the levels or the types of the M-ary pyramid architecture to be used. For example, the type of the M-ary pyramid architecture to be selected can be based on the "activities" (high or low) in a particular frame. In fact, any other block classification methods can be used in conjunction with the present invention.
FIG. 8 depicts a wavelet-based encoder 800 that incorporates the present invention. The encoder contains a block motion compensator (BMC) and motion vector coder 804, subtractor 802, discrete wavelet transform (DWT) coder 806, bit rate controller 810, DWT decoder 812 and output buffer 814.
In general, as discussed above the input signal is a video image (a two-dimensional array of pixels (pels) defining a frame in a video sequence). To accurately transmit the image through a low bit rate channel, the spatial and temporal redundancy in the video frame sequence must be substantially reduced. This is generally accomplished by coding and transmitting only the differences between successive frames. The encoder has three functions: first, it produces, using the BMC and its coder 804, a plurality of motion vectors that represent motion that occurs between frames; second, it predicts the present frame using a reconstructed version of the previous frame combined with the motion vectors; and third, the predicted frame is subtracted from the present frame to produce a frame of residuals that are coded and transmitted along with the motion vectors to a receiver.
The discrete wavelet transform performs a wavelet hierarchical subband decomposition to produce a conventional wavelet tree representation of the input image. To accomplish such image decomposition, the image is decomposed using times two subsampling into high horizontal-high vertical (HH), high horizontal- low vertical (HL), low horizontal-high vertical (LH), and low horizontal-low vertical (LL), frequency subbands. The LL subband is then further subsampled times two to produce a set of HH, HL, LH and LL subbands. This subsampling is accomplished recursively to produce an array of subbands such as that illustrated in FIG. 9 where three subsamplings have been used. Preferably six subsamplings are used in practice. The parent-child dependencies between subbands are illustrated as arrows pointing from the subband of the parent nodes to the subbands of the child nodes. The lowest frequency subband is the top left LL,, and the highest frequency subband is at the bottom right HH3. In this example, all child nodes have one parent. A detailed discussion of subband decomposition is presented in J.M. Shapiro, "Embedded Image Coding Using Zerotrees of Wavelet Coefficients", IEEE Trans, on Signal Processing, Vol. 41, No. 12, pp. 3445-62, December 1993. The DWT coder of FIG. 8 codes the coefficients of the wavelet tree in either a <cbreadth first" or "depth first" pattern. A breadth first pattern traverse the wavelet tree in a bit-plane by bit-plane pattern, i.e., quantize all parent nodes, then all children, then all grandchildren and so on. In contrast, a depth first pattern traverses each tree from the root in the low-low subband (LL,) through the children (top down) or children through the low-low subband
(bottom up). The selection of the proper quantization level by the rate controller 810 is as discussed above to control the bit rate for each macroblock within each frame of a sequence. As such, the present invention can be adapted to various types of encoders that use different transforms. FIG. 6 illustrates an encoding system 600 of the present invention. The encoding system comprises a general purpose computer 610 and various input/output devices 620. The general purpose computer comprises a central processing unit (CPU) 612, a memory 614 and an encoder 616 for receiving and encoding a sequence of images.
In the preferred embodiment, the encoder 616 is simply the encoder 100 and 800 as discussed above. The encoder 616 can be a physical device which is coupled to the CPU 612 through a communication channel. Alternatively, the encoder 616 can be represented by a software application (or a combination of software and hardware, e.g., application specific integrated circuits (ASIC)), which is loaded from a storage device, e.g., a magnetic or optical disk, and resides in the memory 612 of the computer. As such, the encoder 100 of the present invention can be stored on a computer readable medium.
The computer 610 can be coupled to a plurality of input and output devices 620, such as a keyboard, a mouse, a camera, a camcorder, a video monitor, any number of imaging devices or storage devices, including but not limited to, a tape drive, a floppy drive, a hard disk drive or a compact disk drive. The input devices serve to provide inputs to the computer for producing the encoded video bitstreams or to receive the sequence of video images from a storage device or an imaging device. Finally, a communication channel 630 is shown where the encoded signal from the encoding system is forwarded to a decoding system (not shown). Although various embodiments which incorporate the teachings of the present invention have been shown and described in detail herein, those skilled in the art can readily devise many other varied embodiments that still incorporate these teachings.

Claims

What is claimed is:
1. A method for decomposing an image in an image sequence into an M-ary pyramid, said method comprising the steps of: (a) generating an image pyramid having a plurality of levels from the image; and
(b) generating an M-ary pyramid having a plurality of P-bit levels from said image pyramid, where P for at least two of said P-bit levels of said M-ary pyramid is different.
2. The method of claim 1, wherein said P for at least two of said P-bit levels of said M-ary pyramid is one and eight.
3. The method of claim 1, further comprising the step of: (c) performing hierarchical motion estimation on said M-ary pyramid.
4. The method of claim 3, wherein said performing hierarchical motion estimation step (c) comprises the steps of:
(cl) generating a plurality of motion vectors starting from a highest level of said M-ary pyramid, where said plurality of motion vectors are passed hierarchically to a lower level of said M-ary pyramid;
(c2) determining at a next level of said M-ary pyramid if it is necessary to change said P to a different value; and
(c3) changing said P to a different value in accordance with said step (c2) and repeating said steps (cl) and (c2) until a plurality of motion vectors are generated for a lowest level of said M-ary pyramid.
5. The method of claim 4, wherein said motion vector generating step (cl) generates a plurality of motion vectors based on a plurality of tiling block sizes.
6. The method of claim 1, further comprising the step:
(c) selectively changing said P in response to a criterion.
7. A method for performing motion estimation for a sequence of images, where each of said image is divided into at least one block, said method comprising the steps of:
(a) generating a plurality of M-ary pyramids having different pyramid architectures, where each of said M-ary pyramid architectures comprises a plurality of P-bit levels, where P for at least two of said P-bit levels of at least one of said M-ary pyramid architecture is different;
(b) selecting one of said M-ary pyramid architectures for performing hierarchical motion estimation; and (c) generating a plurality of motion vectors starting from a highest level of said selected M-ary pyramid, where said plurality of motion vectors are passed hierarchically to a lower level of said selected M-ary pyramid.
8. The method of claim 7, wherein said M-ary pyramid generating step (a) comprises the steps of:
(al) generating a mean pyramid for the image; and
(a2) generating said plurality of M-ary pyramids from said mean pyramid.
9. An apparatus for decomposing an image in an image sequence into an M- ary pyramid, said apparatus comprises: an image pyramid generator for generating an image pyramid having a plurality of levels from the image; and an M-ary pyramid generator for generating an M-ary pyramid having a plurality of P-bit levels from said image pyramid, where P for at least two of said P-bit levels of said M-ary pyramid is different
10. Apparatus for encoding an image sequence having at least one input frame, said apparatus comprising: a motion compensator for generating a predicted image of a current input frame, said motion compensator comprises an M-ary pyramid generator for generating an M-ary pyramid having a plurality of P-bit levels, where P for at least two of said P-bit levels of said M-ary pyramid is different and a motion estimation module for performing hierarchical motion estimation on said M-ary pyramid; a transform module for applying a transformation to a difference signal between the input frame and said predicted image, where said transformation produces a plurality of coefficients; a quantizer for quantizing said plurality of coefficients with at least one quantizer scale to produce a plurality of quantized coefficients; and a coder for coding said quantized coefficients into a bitstream.
PCT/US1998/027543 1997-12-31 1998-12-31 Apparatus and method for performing scalable hierarchical motion estimation WO1999034331A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2000526903A JP2002500402A (en) 1997-12-31 1998-12-31 Apparatus and method for scalable hierarchical motion estimation
KR1020007007344A KR20010033797A (en) 1997-12-31 1998-12-31 Apparatus and method for performing scalable hierarchical motion estimation
EP98964302A EP1042734A1 (en) 1997-12-31 1998-12-31 Apparatus and method for performing scalable hierarchical motion estimation
AU19470/99A AU1947099A (en) 1997-12-31 1998-12-31 Apparatus and method for performing scalable hierarchical motion estimation

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US09/002,258 US6408101B1 (en) 1997-12-31 1997-12-31 Apparatus and method for employing M-ary pyramids to enhance feature-based classification and motion estimation
US09/002,258 1997-12-31
US09/106,706 US6208692B1 (en) 1997-12-31 1998-06-29 Apparatus and method for performing scalable hierarchical motion estimation
US09/106,706 1998-06-29

Publications (1)

Publication Number Publication Date
WO1999034331A1 true WO1999034331A1 (en) 1999-07-08

Family

ID=26670143

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1998/027543 WO1999034331A1 (en) 1997-12-31 1998-12-31 Apparatus and method for performing scalable hierarchical motion estimation

Country Status (7)

Country Link
US (1) US6208692B1 (en)
EP (1) EP1042734A1 (en)
JP (1) JP2002500402A (en)
KR (1) KR20010033797A (en)
CN (1) CN1215439C (en)
AU (1) AU1947099A (en)
WO (1) WO1999034331A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002093936A1 (en) 2001-05-10 2002-11-21 Sony Corporation Moving picture encoding apparatus

Families Citing this family (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6430317B1 (en) * 1997-12-31 2002-08-06 Sarnoff Corporation Method and apparatus for estimating motion using block features obtained from an M-ary pyramid
AU762996B2 (en) * 1999-02-09 2003-07-10 Motorola Australia Pty Ltd An image compression system and method of determining quantisation parameters therefor
US6633610B2 (en) * 1999-09-27 2003-10-14 Intel Corporation Video motion estimation
US6442203B1 (en) * 1999-11-05 2002-08-27 Demografx System and method for motion compensation and frame rate conversion
WO2001045425A1 (en) * 1999-12-14 2001-06-21 Scientific-Atlanta, Inc. System and method for adaptive decoding of a video signal with coordinated resource allocation
US6594397B1 (en) * 2000-03-03 2003-07-15 Tektronix, Inc. Adaptive multi-modal motion estimation for video compression
WO2002019721A2 (en) * 2000-08-28 2002-03-07 Thomson Licensing S.A. Method and apparatus for motion compensated temporal interpolation of video sequences
EP1320831A2 (en) * 2000-09-12 2003-06-25 Koninklijke Philips Electronics N.V. Video coding method
KR100407691B1 (en) * 2000-12-21 2003-12-01 한국전자통신연구원 Effective Motion Estimation for hierarchical Search
US6873655B2 (en) 2001-01-09 2005-03-29 Thomson Licensing A.A. Codec system and method for spatially scalable video data
WO2002080573A1 (en) * 2001-03-28 2002-10-10 Sony Corporation Quantization apparatus, quantization method, quantization program, and recording medium
US20030206181A1 (en) * 2001-04-13 2003-11-06 Abb Ab System and method for organizing two and three dimensional image data
DE10120395A1 (en) * 2001-04-25 2002-10-31 Bosch Gmbh Robert Device for the interpolation of samples as well as image encoder and image decoder
KR100408294B1 (en) * 2001-09-05 2003-12-01 삼성전자주식회사 Method adapted for low bit-rate moving picture coding
KR100451584B1 (en) * 2001-12-20 2004-10-08 엘지전자 주식회사 Device for encoding and decoding a moving picture using of a wavelet transformation and a motion estimation
US7274857B2 (en) * 2001-12-31 2007-09-25 Scientific-Atlanta, Inc. Trick modes for compressed video streams
US7266151B2 (en) * 2002-09-04 2007-09-04 Intel Corporation Method and system for performing motion estimation using logarithmic search
US20040042551A1 (en) * 2002-09-04 2004-03-04 Tinku Acharya Motion estimation
US20040057626A1 (en) * 2002-09-23 2004-03-25 Tinku Acharya Motion estimation using a context adaptive search
US20040066849A1 (en) * 2002-10-04 2004-04-08 Koninklijke Philips Electronics N.V. Method and system for significance-based embedded motion-compensation wavelet video coding and transmission
US7558441B2 (en) * 2002-10-24 2009-07-07 Canon Kabushiki Kaisha Resolution conversion upon hierarchical coding and decoding
US7020201B2 (en) * 2002-11-20 2006-03-28 National Chiao Tung University Method and apparatus for motion estimation with all binary representation
US7966642B2 (en) * 2003-09-15 2011-06-21 Nair Ajith N Resource-adaptive management of video storage
US8600217B2 (en) * 2004-07-14 2013-12-03 Arturo A. Rodriguez System and method for improving quality of displayed picture during trick modes
CN101023662B (en) * 2004-07-20 2010-08-04 高通股份有限公司 Method and apparatus for motion vector processing
DE102004059978B4 (en) * 2004-10-15 2006-09-07 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for generating a coded video sequence and decoding a coded video sequence using interlayer residue prediction, and a computer program and computer readable medium
US20060146932A1 (en) * 2004-12-30 2006-07-06 Krit Panusopone Method and apparatus for providing motion estimation with weight prediction
US20060233258A1 (en) * 2005-04-15 2006-10-19 Microsoft Corporation Scalable motion estimation
US9667999B2 (en) * 2005-04-25 2017-05-30 Avago Technologies General Ip (Singapore) Pte. Ltd. Method and system for encoding video data
TWI279143B (en) * 2005-07-11 2007-04-11 Softfoundry Internat Ptd Ltd Integrated compensation method of video code flow
US20070064805A1 (en) * 2005-09-16 2007-03-22 Sony Corporation Motion vector selection
US8005308B2 (en) * 2005-09-16 2011-08-23 Sony Corporation Adaptive motion estimation for temporal prediction filter over irregular motion vector samples
US7894527B2 (en) * 2005-09-16 2011-02-22 Sony Corporation Multi-stage linked process for adaptive motion vector sampling in video compression
US8165205B2 (en) * 2005-09-16 2012-04-24 Sony Corporation Natural shaped regions for motion compensation
US7596243B2 (en) * 2005-09-16 2009-09-29 Sony Corporation Extracting a moving object boundary
US7957466B2 (en) * 2005-09-16 2011-06-07 Sony Corporation Adaptive area of influence filter for moving object boundaries
US7885335B2 (en) * 2005-09-16 2011-02-08 Sont Corporation Variable shape motion estimation in video sequence
US7894522B2 (en) * 2005-09-16 2011-02-22 Sony Corporation Classified filtering for temporal prediction
US7620108B2 (en) * 2005-09-16 2009-11-17 Sony Corporation Integrated spatial-temporal prediction
US8059719B2 (en) * 2005-09-16 2011-11-15 Sony Corporation Adaptive area of influence filter
US8107748B2 (en) * 2005-09-16 2012-01-31 Sony Corporation Adaptive motion search range
US8494052B2 (en) * 2006-04-07 2013-07-23 Microsoft Corporation Dynamic selection of motion estimation search ranges and extended motion vector ranges
US8155195B2 (en) * 2006-04-07 2012-04-10 Microsoft Corporation Switching distortion metrics during motion estimation
US20070268964A1 (en) * 2006-05-22 2007-11-22 Microsoft Corporation Unit co-location-based motion estimation
JP4424522B2 (en) * 2006-07-13 2010-03-03 日本電気株式会社 Encoding and decoding apparatus, encoding method and decoding method
CN101617538A (en) * 2007-01-08 2009-12-30 诺基亚公司 The improvement inter-layer prediction that is used for the video coding extended spatial scalability
CN101252659B (en) * 2007-02-25 2010-08-04 青岛海信电器股份有限公司 Circuit and method for managing standby and complete machine powering of television set
CN100461218C (en) * 2007-03-29 2009-02-11 杭州电子科技大学 Method for enhancing medical image with multi-scale self-adaptive contrast change
US20090033791A1 (en) * 2007-07-31 2009-02-05 Scientific-Atlanta, Inc. Video processing systems and methods
US8605786B2 (en) * 2007-09-04 2013-12-10 The Regents Of The University Of California Hierarchical motion vector processing method, software and devices
US8120659B2 (en) * 2008-05-22 2012-02-21 Aptina Imaging Corporation Method and system for motion estimation in digital imaging applications
US8300696B2 (en) * 2008-07-25 2012-10-30 Cisco Technology, Inc. Transcoding for systems operating under plural video coding specifications
WO2010103849A1 (en) * 2009-03-13 2010-09-16 日本電気株式会社 Image identifier extraction device
US8508659B2 (en) * 2009-08-26 2013-08-13 Nxp B.V. System and method for frame rate conversion using multi-resolution temporal interpolation
US8588309B2 (en) * 2010-04-07 2013-11-19 Apple Inc. Skin tone and feature detection for video conferencing compression
MX352139B (en) * 2011-03-10 2017-11-10 Velos Media Int Ltd A method for decoding video.
US11245912B2 (en) 2011-07-12 2022-02-08 Texas Instruments Incorporated Fast motion estimation for hierarchical coding structures
KR20130050149A (en) * 2011-11-07 2013-05-15 오수미 Method for generating prediction block in inter prediction mode
US9998750B2 (en) 2013-03-15 2018-06-12 Cisco Technology, Inc. Systems and methods for guided conversion of video from a first to a second compression format
KR101811718B1 (en) 2013-05-31 2018-01-25 삼성전자주식회사 Method and apparatus for processing the image

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0785688A2 (en) * 1995-12-27 1997-07-23 Sony Corporation Hierarichal encoding of videosignals

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5315670A (en) 1991-11-12 1994-05-24 General Electric Company Digital data compression system including zerotree coefficient coding
US5253058A (en) * 1992-04-01 1993-10-12 Bell Communications Research, Inc. Efficient coding scheme for multilevel video transmission
US5337085A (en) 1992-04-10 1994-08-09 Comsat Corporation Coding technique for high definition television signals
US5412741A (en) * 1993-01-22 1995-05-02 David Sarnoff Research Center, Inc. Apparatus and method for compressing information
GB2286740B (en) * 1994-02-21 1998-04-01 Sony Uk Ltd Coding and decoding of video signals
KR0159434B1 (en) * 1995-04-19 1999-01-15 김광호 Wavelet image encoding and decoding device and method using human visual system modeling
US5757668A (en) * 1995-05-24 1998-05-26 Motorola Inc. Device, method and digital video encoder of complexity scalable block-matching motion estimation utilizing adaptive threshold termination
US5982434A (en) * 1996-03-22 1999-11-09 Sony Corporation Image signal coding method and device thereof, image signal decoding method and device thereof, and recording medium
US5867221A (en) * 1996-03-29 1999-02-02 Interated Systems, Inc. Method and system for the fractal compression of data using an integrated circuit for discrete cosine transform compression/decompression
US5973742A (en) * 1996-05-24 1999-10-26 Lsi Logic Corporation System and method for performing motion estimation with reduced memory loading latency
US5909518A (en) * 1996-11-27 1999-06-01 Teralogic, Inc. System and method for performing wavelet-like and inverse wavelet-like transformations of digital data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0785688A2 (en) * 1995-12-27 1997-07-23 Sony Corporation Hierarichal encoding of videosignals

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GOH W B ET AL: "MODEL-BASED MULTIRESOLUTION MOTION ESTIMATION IN NOISY IMAGES", CVGIP IMAGE UNDERSTANDING, vol. 59, no. 3, 1 May 1994 (1994-05-01), pages 307 - 319, XP000454805 *
LEE X ET AL: "A FAST HIERARCHICAL MOTION-COMPENSATION SCHEME FOR VIDEO CODING USING BLOCK FEATURE MATCHING", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 6, no. 6, 1 December 1996 (1996-12-01), pages 627 - 635, XP000641035 *
MANDAL M KR ET AL: "MULTIRESOLUTION MOTION ESTIMATION TECHNIQUES FOR VIDEO COMPRESSION", OPTICAL ENGINEERING, vol. 35, no. 1, 1 January 1996 (1996-01-01), pages 128 - 135, XP000631420 *
XUDONG SONG ET AL: "A scalable hierarchical motion estimation algorithm for MPEG-2", PROC. OF THE 1998 IEEE INTERNAT. SYMP. ON CIRCUITS AND SYSTEMS (CAT. NO.98CH36187), ISCAS '98, CA, USA, 31 MAY-3 JUNE 1998, vol. 4, ISBN 0-7803-4455-3, 1998, New York, NY, USA, IEEE, USA, pages 126 - 129, XP002101715 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002093936A1 (en) 2001-05-10 2002-11-21 Sony Corporation Moving picture encoding apparatus
EP1387586A1 (en) * 2001-05-10 2004-02-04 Sony Corporation Moving picture encoding apparatus
EP1387586A4 (en) * 2001-05-10 2009-01-07 Sony Corp Moving picture encoding apparatus

Also Published As

Publication number Publication date
US6208692B1 (en) 2001-03-27
KR20010033797A (en) 2001-04-25
AU1947099A (en) 1999-07-19
CN1283291A (en) 2001-02-07
JP2002500402A (en) 2002-01-08
EP1042734A1 (en) 2000-10-11
CN1215439C (en) 2005-08-17

Similar Documents

Publication Publication Date Title
US6208692B1 (en) Apparatus and method for performing scalable hierarchical motion estimation
US6560371B1 (en) Apparatus and method for employing M-ary pyramids with N-scale tiling
EP1138152B8 (en) Method and apparatus for performing hierarchical motion estimation using nonlinear pyramid
US6084908A (en) Apparatus and method for quadtree based variable block size motion estimation
US6895050B2 (en) Apparatus and method for allocating bits temporaly between frames in a coding system
US6430317B1 (en) Method and apparatus for estimating motion using block features obtained from an M-ary pyramid
US6690833B1 (en) Apparatus and method for macroblock based rate control in a coding system
CA2295689C (en) Apparatus and method for object based rate control in a coding system
US6434196B1 (en) Method and apparatus for encoding video information
JP2000511366A6 (en) Apparatus and method for variable block size motion estimation based on quadrant tree
WO2000000932A1 (en) Method and apparatus for block classification and adaptive bit allocation
KR20020026254A (en) Color video encoding and decoding method
US6553071B1 (en) Motion compensation coding apparatus using wavelet transformation and method thereof
US6408101B1 (en) Apparatus and method for employing M-ary pyramids to enhance feature-based classification and motion estimation
WO2003081918A1 (en) Video codec with hierarchical motion estimation in the wavelet domain
US6532265B1 (en) Method and system for video compression
AU2001293994B2 (en) Compression of motion vectors
KR100987581B1 (en) Method of Partial Block Matching for Fast Motion Estimation
Lee et al. Subband video coding with scene-adaptive hierarchical motion estimation
Anandan et al. VIDEO COMPRESSION USING ARPS AND FAST DISCRETE CURVELET TRANSFORM FOR MOBILE DEVICES
Van Der Auwera et al. A new technique for motion estimation and compensation of the wavelet detail images
Kim et al. Video coding with wavelet transform on the very-low-bit-rate communication channel

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 98812628.1

Country of ref document: CN

AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM HR HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
ENP Entry into the national phase

Ref document number: 2000 526903

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020007007344

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 1998964302

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1998964302

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWP Wipo information: published in national office

Ref document number: 1020007007344

Country of ref document: KR

WWW Wipo information: withdrawn in national office

Ref document number: 1998964302

Country of ref document: EP

WWR Wipo information: refused in national office

Ref document number: 1020007007344

Country of ref document: KR