|Publication number||USRE43360 E1|
|Application number||US 12/618,202|
|Publication date||May 8, 2012|
|Filing date||Nov 13, 2009|
|Priority date||Jan 30, 1996|
|Also published as||US6728317, US6957350, US7366242, US7428639, US20040196901, US20050254649, USRE44235, USRE45082|
|Publication number||12618202, 618202, US RE43360 E1, US RE43360E1, US-E1-RE43360, USRE43360 E1, USRE43360E1|
|Inventors||Gary A. Demos|
|Original Assignee||Dolby Laboratories Licensing Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (34), Non-Patent Citations (27), Referenced by (5), Classifications (68), Legal Events (1)|
|External Links: USPTO, USPTO Assignment, Espacenet|
This application is a divisional of and claims priority to U.S. application Ser. No. 09/545,233 filed on Apr. 7, 2000 now U.S. Pat. No. 6,728,317 (which is incorporated herein in its entirety), which was a continuation-in-part application of U.S. application Ser. No. 09/442,595 filed on Nov. 17, 1999 now abandoned, which was a continuation of U.S. application Ser. No. 09/217,151 filed on Dec. 21, 1998 (now U.S. Pat. No. 5,988,863, issued Nov. 23, 1999), which was a continuation of U.S. application Ser. No. 08/594,815 filed Jan. 30, 1996 (now U.S. Pat. No. 5,852,565, issued Dec. 22, 1998).
This invention relates to electronic communication systems, and more particularly to an advanced electronic television system having enhanced compression, filtering, and display characteristics.
The United States presently uses the NTSC standard for television transmissions. However, proposals have been made to replace the NTSC standard with an Advanced Television standard. For example, it has been proposed that the U.S. adopt digital standard-definition and advanced television formats at rates of 24 Hz, 30 Hz, 60 Hz, and 60 Hz interlaced. It is apparent that these rates are intended to continue (and thus be compatible with) the existing NTSC television display rate of 60 Hz (or 59.94 Hz). It is also apparent that “3-2 pulldown” is intended for display on 60 Hz displays when presenting movies, which have a temporal rate of 24 frames per second (fps). However, while the above proposal provides a menu of possible formats from which to select, each format only encodes and decodes a single resolution and frame rate. Because the display or motion rates of these formats are not integrally related to each other, conversion from one to another is difficult.
Further, this proposal does not provide a crucial capability of compatibility with computer displays. These proposed image motion rates are based upon historical rates which date back to the early part of this century. If a “clean-slate” were to be made, it is unlikely that these rates would be chosen. In the computer industry, where displays could utilize any rate over the last decade, rates in the 70 to 80 Hz range have proven optimal, with 72 and 75 Hz being the most common rates. Unfortunately, the proposed rates of 30 and 60 Hz lack useful interoperability with 72 or 75 Hz, resulting in degraded temporal performance.
In addition, it is being suggested by some that interlace is required, due to a claimed need to have about 1000 lines of resolution at high frame rates, but based upon the notion that such images cannot be compressed within the available 18-19 mbits/second of a conventional 6 MHz broadcast television channel.
It would be much more desirable if a single signal format were to be adopted, containing within it all of the desired standard and high definition resolutions. However, to do so within the bandwidth constraints of a conventional 6 MHz broadcast television channel requires compression and “scalability” of both frame rate (temporal) and resolution (spatial). One method specifically intended to provide for such scalability is the MPEG-2 standard. Unfortunately, the temporal and spatial scalability features specified within the MPEG-2 standard (and newer standards, like MPEG-4) are not sufficiently efficient to accommodate the needs of advanced television for the U.S. Thus, the proposal for advanced television for the U.S. is based upon the premise that temporal (frame rate) and spatial (resolution) layering are inefficient, and therefore discrete formats are necessary.
Further, it would be desirable to provide enhancements to resolution, image clarity, coding efficiency, and video production efficiency. The present invention provides such enhancements.
The invention provides a number of enhancements to handle a variety of video quality and compression problems. The following describes a number of such enhancements, most of which are preferably embodied as a set of tools which can be applied to the tasks of enhancing images and compressing such images. The tools can be combined by a content developer in various ways, as desired, to optimize the visual quality and compression efficiency of a compressed data stream, particularly a layered compressed data stream.
Such tools include improved de-interlacing and noise reduction enhancements, including motion analysis.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.
FIG. 5 is a diagram of the relative shape, amplitudes, and lobe polarity of a preferred downsizing filter.
FIGS. 6A and 6B are diagrams of the relative shape, amplitudes, and lobe polarity of a pair of preferred upsizing filters for upsizing by a factor of 2.
Like reference symbols in the various drawings indicate like elements.
Throughout this description, the preferred embodiment and examples shown should be considered as exemplars, rather than as limitations on the invention.
A number of enhancements may be made to handle a variety of video quality and compression problems. The following describes a number of such enhancements, most of which are preferably embodied as a set of tools which can be applied to the tasks of enhancing images and compressing such images. The tools can be combined by a content developer in various ways, as desired, to optimize the visual quality and compression efficiency of a compressed data stream, particularly a layered compressed data stream.
Experimentation has shown that many de-interlacing algorithms and devices depend upon the human eye to integrate fields to create an acceptable result. However, since compression algorithms are not a human eye, any integration of de-interlaced fields should take into account the characteristics of such algorithms. Without such careful de-interlaced integration, the compression process will create high levels of noise artifacts, both wasting bits (hindering compression) as well as making the image look noisy and busy with artifacts. This distinction between de-interlacing for viewing (such as with line-doublers and line-quadruplers) vs. de-interlacing as input to compression, has lead to the techniques described below. In particular, the de-interlacing techniques described below are useful as input to single-layer non-interlaced MPEG-like, as well as to the layered MPEG-like compression.
Further, noise reduction must similarly match the needs of being an input to compression algorithms, rather than just reducing noise appearance. The goal is generally to reproduce, upon decompression, no more noise than the original camera or film-grain noise. Equal noise is generally considered acceptable, after compression/decompression. Reduced noise, with equivalent sharpness and clarity with the original, is a bonus. The noise reduction described below achieves these goals.
Further, for very noisy shots, such as from high speed film or with high camera sensitivity settings, usually in low light, noise reduction can be the difference between a good looking compressed/decompressed image vs. one which is unwatchably noisy. The compression process greatly amplifies noise which is above some threshold of acceptability to the compressor. Thus, the use of noise-reduction pre-processing to keep noise below this threshold may be required for acceptable good quality results.
De-Graining and Noise-Reducing Filters
It has been found through experimentation that applying de-graining and/or noise-reducing filtering before layered or non-layered encoding improves the ability of the compression system to perform. While de-graining or noise-reduction is most effective on grainy or noisy images prior to compression, either process may be helpful when used in moderation even on relatively low noise or low grain pictures. Any of several known de-graining or noise-reduction algorithms may be applied. Examples are “coring”, simple neighbor median filters, and softening filters.
Whether noise-reduction is needed is determined by how noisy the original images are. For interlaced original images, the interlace itself is a form of noise, which usually will require additional noise reduction filtering, in addition to the complex de-interlacing process described below. For progressive scan (non-interlaced) camera or film images, noise processing is useful in layered and non-layered compression when noise is present above a certain level.
There are different types of noise. For example, video transfers from film include film grain noise. Film grain noise is caused by silver grains which couple to yellow, cyan, and magenta film dyes. Yellow affects both red and green, cyan affects both blue and green, and magenta affects both red and blue. Red is formed where yellow and magenta dye crystals overlap. Similarly green is the overlap of yellow and cyan, and blue is the overlap of magenta and cyan. Thus, noise between colors is partially correlated through the dyes and grains between pairs of colors. Further, when multiple grains overlap in all three colors, as they do in a print dark regions of the image or on a negative in light regions of the image (dark on the negative), additional color combinations occur. This correlation between the colors can be utilized in film-grain noise reduction, but is a complex process. Further, many different film types are used, and each type has different grain sizes, shapes, and statistical distributions.
For video images created by CCD-sensor and other (e.g., tube) sensor cameras, the red, green, and blue noise is uncorrelated. In this case, it is best to process the red, green, and blue records independently. Thus, red noise is reduced with self-red processing independently of green noise and blue noise; the same approach applies to green and blue noise.
Thus, noise processing is best matched to the characteristics of the noise source itself. In the case of a composite image (from multiple sources), the noise may differ in characteristics over different portions of the image. In this situation, generic noise processing may be the only option, if noise processing is needed.
It has also been found useful in some cases to perform a “re-graining” or “re-noising” process after decoding a compressed layered data stream, as a creative effect, since some de-grained or de-noised images may be “too clean” or “too sterile” in appearance. Re-graining and/or re-noising are relatively easy effects to add in the decoder using any of several known algorithms. For example, this can be accomplished by the addition of low pass filtered random noise of suitable amplitude.
De-Interlacing Before Compression
As mentioned above, the preferred compression method for interlaced source which is ultimately intended for non-interlaced display includes a step to de-interlace the interlaced source before the compression steps. De-interlacing a signal after decoding in the receiver, where the signal has been compressed in the interlaced mode, is both more costly and less efficient than de-interlacing prior to compression, and then sending a non-interlaced compressed signal. The non-interlaced compressed signal can be either layered or non-layered (i.e., a conventional single layer compression).
Experimentation has shown that filtering a single field of an interlaced source, and using that field as if it were a non-interlaced full frame, gives poor and noisy compression results. Thus, using a single-field de-interlacer prior to compression is not a good approach. Instead, experimentation has shown that a three-field-frame de-interlacer process using field synthesized frames (“field-frames”), with weights of [0.25, 0.5, 0.25] for the previous, current, and next field-frames, respectively, provides a good input for compression. Combining three field-frames may be performed using other weights (although these weights are optimal) to create a de-interlaced input to a compression process.
In the preferred de-interlacing system, a field-de-interlacer is used as the first step in the overall process to create field-frames. In particular, each field is de-interlaced, creating a synthesized frame where the total number of lines in the frame is derived from the half number of lines in a field. Thus, for example, an interlaced 1080 line image will have 540 lines per even and odd field, each field representing 1/60th of a second. Normally, the even and odd fields of 540 lines will be interlaced to create 1080 lines for each frame, which represents 1/30th of a second. However, in the preferred embodiment, the de-interlacer copies each scanline without modification from a specified field (e.g., the odd fields) to a buffer that will hold some of the de-interlaced result. The remaining intermediate scanlines (in this example, the even scanlines) for the frame are synthesized by adding half of the field line above and half of the field line below each newly stored line. For example, the pixel values of line 2 for a frame would each comprise ½ of the summed corresponding pixel values from each of line 1 and line 3. The generation of intermediate synthesized scanlines may be done on the fly, or may be computed after all of the scanlines from a field are stored in a buffer. The same process is repeated for the next field, although the field types (i.e., even, odd) will be reversed.
As a next step, a sequence of these de-interlaced fields is then used as input to a three-field-frame de-interlacer to create a final de-interlaced frame.
The new de-interlaced frame than contains much fewer interlace difference artifacts between frames than do the three field-frames of which it is composed. However, there is a temporal smearing by adding the previous field-frame and next field-frame into a current field-frame. This temporal smearing is usually not objectionable, especially in light of the de-interlacing improvements which result.
This de-interlacing process is very beneficial as input to compression, either single layer (unlayered) or layered. It is also beneficial just as a treatment for interlaced video for presentation, viewing, or making still frames, independent of use with compression. The picture from the de-interlacing process appears “clearer” than the presentation of the interlace directly, or of the de-interlaced fields.
Although the de-interlace three-field sum weightings of [0.25, 0.5, 0.25] discussed above provide a stable image, moving parts of a scene can sometimes become soft or can exhibit aliasing artifacts. To counteract this, a threshold test may be applied which compares the result of the [0.25, 0.5, 0.25] temporal filter against the corresponding pixel values of only the middle field-frame. If a middle field-frame pixel value differs more than a specified threshold amount from the value of the corresponding pixel from the three-field-frame temporal filter, then only the middle field-frame pixel value is used. In this way, a pixel from the three-field-frame temporal filter is selected where it differs less than the threshold amount from the corresponding pixel of the single de-interlaced middle field-frame, and the middle field-frame pixel value is used when there is more difference than the threshold. This allows fast motion to be tracked at the field rate, and smoother parts of the image to be filtered and smoothed by the three-field-frame temporal filter. This combination has proven an effective, if not optimal, input to compression. It is also very effective for processing for direct viewing to de-interlace image material (also called line doubling in conjunction with display).
The preferred embodiment for such threshold determinations uses the following equations for corresponding RGB color values from the middle (single) de-interlaced field-frame image and the three-field-frame de-interlaced image:
Rdiff=R_single_field_de-interlaced minus R_three_-field_de-interlaced
Gdiff=G_single_field_de-interlaced minus G_three_-field_de-interlaced
Bdiff=B_single_field_de-interlaced minus B_three_-field_de-interlaced
The ThresholdingValue is then compared to a threshold setting. Typical threshold settings are in the range of 0.1 to 0.3, with 0.2 being most common.
In order to remove noise from this threshold, smooth-filtering the three-field-frame and single-field-frame de-interlaced pictures can be used before comparing and thresholding them. This smooth filtering can be accomplished simply by down filtering (e.g., down filtering by two), and then up filtering (e.g., using a gaussian up-filter by two). This “down-up” smoothed filter can be applied to both the single-field-frame de-interlaced picture and the three-field-frame de-interlaced picture. The smoothed single-field-frame and three-field-frame pictures can then be compared to compute a ThresholdingValue and then thresholded to determine which picture will source each final output pixel.
In particular, the threshold test is used as a switch to select between the single-field-frame de-interlaced picture and the three-field-frame temporal filter combination of single-field-frame de-interlaced pictures. This selection then results in an image where the pixels are from the three-field-frame de-interlacer in those areas where that image differs in small amounts (i.e., below the threshold) from the single field-frame image, and where the pixels are from the single field-frame image in those areas where the three-field-frame differed more than then the threshold amount from the single-field-frame de-interlaced pixels (after smoothing).
This technique has proven effective in preserving single-field fast motion details (by switching to the single-field-frame de-interlaced pixels), while smoothing large portions of the image (by switching to the three-field-frame de-interlaced temporal filter combination).
In addition to selecting between the single-field-frame and three-field-frame de-interlaced image, it is also often beneficial to add a bit of the single-field-frame image to the three-field-frame de-interlaced picture, to preserve some of the immediacy of the single field pictures over the entire image. This immediacy is balanced against the temporal smoothness of the three-field-frame filter. A typical blending is to create new frame by adding 33.33% (⅓) of a single middle field-frame to 66.67% (⅔) of the corresponding three-field-frame smoothed image. This can be done before or after threshold switching, since the result is the same either way, only affecting the smoothed three-field-frame picture. Note that this is effectively equivalent to using a different proportion of the three field-frames, rather than the original three-field frame weights of [0.25, 0.5, 0.25]. Computing ⅔ of [0.25, 0.5, 0.25] plus ⅓ of (0,1,0), yields [0.1667, 0.6666, 0.1667] as the temporal filter for the three field-frames. The more heavily weighted center (current) field-frame brings additional immediacy to the result, even in the smoothed areas which fell below the threshold value. This combination has proven effective in balancing temporal smoothness wills immediacy in the de-interlacing process for moving parts of a scene.
Use of Linear Filters
Sums, filters, or matrices involving video pictures should take into account the fact that pixel values in video are non-linear signals. For example, the video curve for HDTV can be several variations of coefficients and factors, but a typical formula is the international CCIR XA-11 (now called Rec. 709):
V=1.0993*L0.45−0.0993 for L>0.018051
V=4.5*L for L<=0.018051
where V is the video value and L is linear light luminance.
The variations adjust the threshold (0.018051) a little, the factor (4.5) a little (e.g. 4.0), and the exponent (0.45) a little (e.g., 0.4). The fundamental formula, however, remains the same.
A matrix operation, such as a RGB to/from YUV conversion, implies linear values. The fact that MPEG in general uses the video non-linear values as if they were linear results in leakage between the luminance (Y) and the color values (U, and V). This leakage interferes with compression efficiency. The use of a logarithmic representation, such as is used with film density units, corrects much of this problem. The various types of MPEG encoding are neutral to the non-linear aspects of the signal, although its efficiency is effected due to the use of the matrix conversion RGB to/from YUV.YUV (U=R−Y, V=B−Y) should have Y computed as a linearized sum of 0.59 G, plus 0.29 R, plus 0.12 B (or slight variations on these coefficients). However, U (=R−Y) becomes equivalent to R/Y in logarithmic space, which is orthogonal to luminance. Thus, a shaded orange ball will not vary the U (=R−Y) parameter in a logarithmic representation. The brightness variation will be represented completely in the Luminance parameter, where full detail is provided.
The linear vs. logarithmic vs. video issue impacts filtering. A key point to note is that small signal excursions (e.g. 10% or less) are approximately correct when a non-linear video signal is processed as if it were a linear signal. This is because a piece-wise linear approximation to the smooth video-to-from-linear conversion curve is reasonable. However, for large excursions, a linear filter is much more effective, and produces much better image quality. Accordingly, if large excursions are to be optimally coded, transformed, or otherwise processed, it would be desirable to first convert the non-linear signal to a linear one in order to be able to apply a linear filter.
De-interlacing is therefore much better when each filter and summation step utilizes conversions to linear values prior to filtering or summing. This is due to the large signal excursions inherent in interlaced signals at small details of the image. After filtering, the image signals are converted back to the non-linear video digital representation. Thus, the three-field-frame weighting (e.g., [0.25, 0.5, 0.25] or [0.1667, 0.6666, 0.1667]) should be performed on a linearized video signal. Other filtering and weighted sums of partial terms in noise and de-interlace filtering should also be converted to linear form for computation. Which operations warrant linear processing is determined by signal excursion, and the type of filtering. Image sharpening can be appropriately computed in video or logarithmic non-linear representations, since it is self-proportional. However, matrix processing, spatial filtering, weighted sums, and de-interlace processing should be computed using linearized digital values.
As a simple example, the single field-frame de-interlacer described above computes missing alternate lines by averaging the line above and below each actual line. This average is much more correct numerically and visually if this average is done linearly. Thus, instead of summing 0.5 times the line above plus 0.5 times the line below, the digital values are linearized first, then averaged, and then reconverted back into the non-linear video representation.
In noise processing, the most useful filter is the median filter. A three element median filter just ranks the three entries, via a simple sort, and picks the middle one. For example, an X (horizontal) median filter looks at the red value (or green or blue) of three adjacent horizontal pixels, and picks the one with the middle-most value. If two are the same, that value is selected. Similarly, a Y (vertical) filter looks in the scanlines above and below the current pixel, and again picks the middle value.
It has been experimentally determined that it is useful to average the results from applying both an X and a Y median filter to create a new noise-reducing component picture (i.e., each new pixel is the 50% equal average of the X and Y medians for the corresponding pixel from a source image).
In addition to X and Y (horizontal and vertical) medians, it is also possible to take diagonal and other medians. However, the vertical and horizontal pixel values are most close physically to any particular pixel, and therefore produce less potential error or distortion than the diagonals. However, such other medians remain available in cases where noise reduction is difficult using only the vertical and horizontal medians.
Another beneficial source of noise reduction is information from the previous and subsequent frame (i.e., a temporal median). As mentioned below, motion analysis provides the best match for moving regions. However, it is compute intensive. If a region of the image is not moving, or is moving slowly, the red values (and green and blue) from a current pixel can be median filtered with the red value at that same pixel location in the previous and subsequent frames. However, odd artifacts may occur if significant motion is present and such a temporal filter is used. Thus, it is preferred that a threshold be taken first, to determine whether such a median would differ more than a selected amount from the value of a current pixel. The threshold can be computed essentially the same as for the de-interlacing threshold above:
Rdiff=R_current_pixel minus R_temporal_median
Gdiff=G_current_pixel minus G_temporal_median
Bdiff=B_current_pixel minus B_temporal_median
The ThresholdingValue is then compared to a threshold setting. Typical threshold settings are in the range 0.1 to 0.3, with 0.2 being typical. Above the threshold, the current value is kept. Below the threshold, the temporal median is used. The block diagram of
An additional median type is a median taken between the X, Y, and temporal medians. Another median type can take the temporal median, and then take the equal average of the X and Y medians from it.
Each type of median can cause problems. X and Y medians smear and blur an image, so that it looks “greasy”. Temporal medians cause smearing of motion over time. Since each median can result in problems, yet each median's properties are different (and, in some sense, “orthogonal”), it has been determined experimentally that the best results come by combining a variety of medians.
50% of the original image (Frame N 40) (thus, the most noise reduction is 3 db, or half);
15% of the average of X and Y medians 42, 44, respectively;
10% of the thresholded temporal median 46;
10% of the average of X and Y medians of the thresholded temporal median (48); and
15% of a three-way X, Y, and temporal median (50).
This set of time medians does a reasonable job of reducing the noise in the image without making it appear “greasy” or blurred, causing temporal smearing of moving objects, or losing detail. Another useful weighting of these five terms is 35%, 20%, 22.5%, 10%, and 12.5%, respectively.
In addition, it is useful to apply motion-compensation by applying center weighted temporal filters to a motion-compensated n×n region, as described below. This can be added to the median filtered image result (of five terms, just described) to further smooth the image, providing better smoothing and detail on moving image regions.
In addition to “in-place” temporal filtering, which does a good job at smoothing slow-moving details, de-interlacing and noise reduction can also be improved by use of motion analysis. Adding the pixels at the same location in three fields or three frames is valid for stationary objects. However, for moving objects, if temporal averaging/smoothing is desired, it is often more optimal to attempt to analyze prevailing motion over a small group of pixels. For example, an n×n block of pixels (e.g., 2×2, 3×3, 4×4, 6×6, or 8×8) can be used to search in previous and subsequent fields or frames to attempt to find a match (in the same way MPEG-2 motion vectors are found by matching 16×16 macroblocks). Once a best match is found in one or more previous and subsequent frames, a “trajectory” and “moving mini-picture” can be determined. For interlaced fields, it is best to analyze comparisons as well as compute inferred moving mini-pictures utilizing the results of the thresholded de-interlaced process above. Since this process has already separated the fast-moving from the slow-moving details, and has already smoothed the slow moving details, the picture comparisons and reconstructions are more applicable than individual de-interlaced fields.
The motion analysis preferably is performed by comparison of an n×n block in the current thresholded de-interlaced image with all nearby blocks in the previous and subsequent one or more frames. The comparison may be the absolute value of differences in luminance or ROB over the n×n block. One frame is sufficient forward and backward if the motion vectors are nearly equal and opposite. However, if the motion vectors are not nearly equal and opposite, than an additional one or two frames forward and backward can help determine the actual trajectory. Further, different de-interlacing treatments may be useful in helping determine the “best guess” motion vectors going forward and back. One de-interlacing treatment can be to use only individual de-interlaced fields, although this is heavily prone to aliasing and artifacts on small moving details. Another de-interlacing technique is to use only the three-field-frame smooth de-interlacing, without thresholding, having weightings [0.25, 0.5, 0.25], as described above. Although details are smoothed and sometimes lost, the trajectory may often be more correct.
Once a trajectory is found, a “smoothed n×n block” can be created by temporally filtering using the motion-vector-offset pixels from the one (or more) previous and subsequent frames. A typical filter might again be [0.25, 0.5, 0.25] or [0.1667, 0.6666, 0.1667] for three frames, and possibly [0.1, 0.2, 0.4, 0.2, 0.1] for two frames back and forward. Other filters, with less central weight, are also useful, especially with smaller block sizes (such as 2×2, 3×3, and 4×4). Reliability of the match between frames is indicated by the absolute difference value. Large minimum absolute differences can be used to select more center weight in the filter. Lower values of absolute differences can suggest a good match, and can be used to select less center weight to more evenly distribute the average over a span of several frames of motion-compensated blocks.
These filter weights can be applied to: individual de-interlaced motion-compensated field-frames; thresholded three-field-frame de-interlaced pictures, described above; and non-thresholded three-field-frame de-interlaced images, with a [0.25, 0.5, 0.25] weighting, also as described above. However, the best filter weights usually come from applying the motion-compensated block linear filtering to the thresholded three-field-frame result described above. This is because the thresholded three-field-frame image is both the smoothest (in terms of removing aliasing in smooth areas), as well as the most motion-responsive (in terms of defaulting to a single de-interlaced field-frame above the threshold). Thus, the motion vectors from motion analysis can be used as the inputs to multi-frame or multi-de-interlaced-field-frame or single-de-interlaced field-frame filters, or combinations thereof. The thresholded multi-field-frame de-interlaced images, however, form the best filter input in most cases.
The use of motion analysis is computationally expensive for a large search region, when fast motion might be found (such as ±32 pixels). Accordingly, it may be best to augment the speed by using special-purpose hardware or a digital signal processor assisted computer.
Once motion vectors are found, together with their absolute difference measure of accuracy, they can be utilized for the complex process of attempting frame rate conversion. However, occlusion issues (objects obscuring or revealing others) will confound matches, and cannot be accurately inferred automatically. Occlusion can also involve temporal aliasing, as can normal image temporal undersampling and its beat with natural image frequencies (such as the “backward wagon wheel” effect in movies). These problems often cannot be unraveled by any known computation technique, and to date require human assistance. Thus, human scrutiny and adjustment, when real-time automatic processing is not required, can be used for off-line and non-real-time frame-rate conversion and other similar temporal processes.
De-interlacing is a simple form of the same problem. Just as with frame-rate-conversion, the task of de-interlacing is theoretically impossible to perform perfectly. This is especially due to the temporal undersampling (closed shutter), and an inappropriate temporal sample filter (i.e., a box filter). However, even with correct samples, issues such as occlusion and interlace aliasing further ensure the theoretical impossibility of correct results. The cases where this is visible are mitigated by the depth of the tools, as described here, which are applied to the problem. Pathological cases will always exist in real image sequences. The goal can only be to reduce the frequency and level of impairment when these sequences are encountered. However, in many cases, the de-interlacing process can be acceptably fully automated, and can run unassisted in real-time. Even so, there are many parameters which can often benefit from manual adjustment.
Filter Smoothing of High Frequencies
In addition to median filtering, reducing high frequency detail will also reduce high frequency noise. However, this smoothing comes at the price of loss of sharpness and detail. Thus, only a small amount of such smoothing is generally useful. A filter which creates smoothing can be easily made, as with the threshold for de-interlacing, by down-filtering with a normal filter (e.g., truncated sinc filter) and then up-filtering with a gaussian filter. The result will be smoothed because it is devoid of high frequency picture detail. When such a term is added, it typically must be in very small amounts, such as 5% to 10%, in order to provide a small amount of noise reduction. In larger amounts, the blurring effect generally becomes quite visible.
Base Layer Noise Filtering
The filter parameters for the median filtering described above for an original image should be matched to the noise characteristics of the film grain or image sensor that captured the image. After this median filtered image is down-filtered to generate an input to the base layer compression process, it still contains a small amount of noise. This noise may be further reduced by a combination of another X-Y median filters (equally averaging the X and Y medians), plus a very small amount of the high frequency smoothing filter. A preferred filter weighting of these three terms, applied to each pixel of the base layer, is:
75% of the original base layer (down filtered from median-filtered original above);
22.5% of the average of X and Y medians; and
7.5% of the down-up smoothing filter.
This small amount of additional filtering in the base layer provides a small additional amount of noise reduction and improved stability, resulting in better MPEG encoding and limiting the amount of noise added by such encoding.
Downsizing and Upsizing Filters
Experimentation has shown that the downsizing filter used in creating a base layer from a high resolution original picture is most optimal if it includes modest negative lobes and an extent which stops after the first very small positive lobes after the negative lobes. FIG. 5 is a diagram of the relative shape, amplitudes, and lobe polarity of a preferred downsizing filter. The down filter essentially is a center-weighted function which has been truncated to a center positive lobe 500, a symmetric pair of adjacent (bracketing) small negative lobes 504, and a symmetric pair of adjacent (bracketing) very small outer positive lobes 504. The absolute amplitude of the lobes 500, 502, 504 may be adjusted as desired, so long as the relative polarity and amplitude inequality relationships shown in FIG. 5 are maintained. However, a good first approximation for the relative amplitudes are defined by a truncated sinc function (sinc(x)=sin(x)/x)). Such filters can be used separably, which means that the horizontal data dimension is independently filtered and resized, and then the vertical data dimension, or vise versa; the result is the same.
When creating a base layer original (as input to the base layer compression) from a low-noise high resolution original input, the preferred downsizing filter has first negative lobes which are of a normal sinc function amplitude. For clean and for high resolution input images, this normal truncated sinc function works well. For lower resolutions (e.g., 1280×720, 1024×768, or 1536×768), and for noisier input pictures, a reduced first negative lobe amplitude in the filters is more optimal. A suitable amplitude in such cases is about half the truncated sinc function negative lobe amplitude. The small first positive lobes outside of the first negative lobes are also reduced to lower amplitude, typically to ½ to ⅔ of the normal sinc function amplitude. The affect of reducing the first negative lobes is the main issue, since the small outside positive lobes do not contribute to picture noise. Further samples outside the first positive lobes preferably are truncated to minimize ringing and other potential artifacts.
The choice of whether to use milder negative lobes or full sinc function amplitude negative lobes in the downfilter is determined by the resolution and noise level of the original image. It is also somewhat a function of image content, since some types of scenes are easier to code than others (mainly related to the amount of motion and change in a particular shot). By using a “milder” downfilter having reduced negative lobes, noise in the base layer is reduced, and a cleaner and quieter compression of the base layer is achieved, thus also resulting in fewer artifacts.
Experimentation has also shown that the optimal upsizing filter has a center positive lobe with small adjacent negative lobes, but no further positive lobes. FIGS. 6A and 6B are diagrams of the relative shape, amplitudes, and lobe polarity of a pair of preferred upsizing filters for upsizing by a factor of 2. A central positive lobe 600, 600′ is bracketed by a pair of small negative lobes 602, 602′. An asymmetrically placed positive lobe 604, 604′ is also required. These paired upfilters could also be considered to be truncated sinc filters centered on the newly created samples. For example, for a factor of two upfilter, two new samples will be created for each original sample. The small adjacent negative lobes 602, 602′ have less negative amplitude than is used in the corresponding downsizing filter (FIG. 5), or than would be used in an optimal (sinc-based) upsizing filter for normal images. This is because the images being upsized are decompressed, and the compression process changes the spectral distribution. Thus, more modest negative lobes, and no additional positive lobes beyond the middle ones 600, 600′, work better for upsizing a decompressed base layer.
Experimentation has shown that slight negative lobes 602, 602′ provide a better layered result than positive-only gaussian or spline upfilters (note that splines can have negative lobes, but are most often used in the positive-only form). Thus, this upsizing filter preferably is used for the base layer in both the encoder and the decoder.
Weighting of High Octave of Picture Detail
In the preferred embodiment, the signal path which expands the original uncompressed base layer input image uses a gaussian upfilter rather than the upfilter described above. In particular, a gaussian upfilter is used for the “high octave” of picture detail, which is determined by subtracting the expanded original base-resolution input image (without using compression) from the original picture. Thus, no negative lobes are used for this particular upfiltered expansion.
As noted above, for MPEG-2 this high octave difference signal path is typically weighted with 0.25 (or 25%) and added to the expanded decompressed base layer (using the other upfilter described above) as input to the enhancement layer compression process. However, experimentation has shown that weights of 10%, 15%, 20%, 30%, and 35% are useful for particular images when using MPEG-2. Other weights may also prove useful. For MPEG-4, it has been found that filter weights of 4-8% may be optimal when used in conjunction with other improvements described below. Accordingly, this weighting should be regarded as an adjustable parameter, depending upon the encoding system, the scenes being encoded/compressed, the particular camera (or film) being used, and the image resolution.
Filters with Negative Lobes For Motion Compensation in MPEG-2 and MPEG-4
In MPEG-4, reference filters have been implemented for shifting macroblocks when finding the best motion vector match, and then using the matched region for motion compensation. MPEG-4 video coding, like MPEG-2, supports ½ pixel resolution of motion vectors for macroblocks. Unlike MPEG-2, MPEG-4 also supports ¼ pixel accuracy. However, in the reference implementation of MPEG-4, the filters used are sub-optimal. In MPEG-2, the half-way point between pixels is just the average of the two neighbors, which is a sub-optimal box filter. In MPEG-4, this filter is used for ½ pixel resolution. If ¼ pixel resolution is invoked in MPEG-4 Part 2, a filter with negative lobes is used for the half-way point, but a sub-optimal box filter with this result and the neighboring pixels is used for the ¼ and ¾ points.
Further, the chrominance channels (U=R−Y and V=B−Y) do not use any sub-pixel resolution in the motion compensation step under MPEG-4. Since the luminance channel (Y) has resolution to the ½ or ¼ pixel, the half-resolution chrominance U and V channels should be sampled using filters to ¼ pixel resolution, corresponding to ½ pixel in luminance. When ¼ pixel resolution is selected for luminance, then ⅛ pixel resolution should be used for U and V chrominance.
Experiments have shown that the effects of filtering are significantly improved by using a negative lobe truncated sinc function (as described above) for filtering the ¼, ½, and ¾ pixel points when doing ¼ pixel resolution in luminance, and by using similar negative lobes when doing ½ pixel resolution for the filter which creates the ½ pixel position.
Similarly, effects of filtering are significantly improved by using a negative lobe truncated sinc function for filtering the ⅛-pixel points for U and V chrominance when using ¼ pixel luminance resolution, and by using ¼ pixel resolution filters with similar negative lobe filters when using ½ pixel luminance resolution.
It has been discovered that the combination of quarter-pixel motion vectors with truncated sinc motion compensated displacement filtering results in a major improvement in picture quality. In particular, clarity is improved, noise and artifacts are reduced, and chroma detail is increased.
These filters may be applied to video images under MPEG-1, MPEG-2, MPEG-4 or any other appropriate motion-compensated block-based image coding system.
The invention may be implemented in hardware or software, or a combination of both. However, preferably, the invention is implemented in computer programs executing on one or more programmable computers each comprising at least a processor, a data storage system (including volatile and non-volatile memory and/or storage elements), an input device, and an output device. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices, in known fashion.
Each such program may be implemented in any desired computer language (including machine, assembly, or high level procedural, logical, or object oriented programming languages) to communicate with a computer system. In any case, the language may be a compiled or interpreted language.
Each such computer program is preferably stored on a storage media or device (e.g., ROM, CD-ROM, or magnetic or optical media) readable by a general or special purpose programmable computer system, for configuring and operating the computer when the storage media or device is read by the computer system to perform the procedures described herein. The inventive system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer system to operate in a specific and predefined manner to perform the functions described herein.
A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, while the preferred embodiment uses MPEG-2 or MPEG-4 coding and decoding, the invention will work with any comparable standard that provides equivalents of I, P, and/or B frames and layers. Accordingly, it is to be understood that the invention is not to be limited by the specific illustrated embodiment, but only by the scope of the appended claims.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US4903317||Jun 23, 1987||Feb 20, 1990||Kabushiki Kaisha Toshiba||Image processing apparatus|
|US5253058||Apr 1, 1992||Oct 12, 1993||Bell Communications Research, Inc.||Efficient coding scheme for multilevel video transmission|
|US5270813||Jul 2, 1992||Dec 14, 1993||At&T Bell Laboratories||Spatially scalable video coding facilitating the derivation of variable-resolution images|
|US5387940||Jul 7, 1993||Feb 7, 1995||Rca Thomson Licensing Corporation||Method and apparatus for providing scaleable compressed video signal|
|US5408270||Jun 24, 1993||Apr 18, 1995||Massachusetts Institute Of Technology||Advanced television system|
|US5414469||Oct 31, 1991||May 9, 1995||International Business Machines Corporation||Motion video compression system with multiresolution features|
|US5418571||Jan 31, 1992||May 23, 1995||British Telecommunicatons Public Limited Company||Decoding of double layer video signals with interpolation replacement on missing data from enhancement layer|
|US5465119||Feb 22, 1991||Nov 7, 1995||Demografx||Pixel interlacing apparatus and method|
|US5493338||Sep 12, 1994||Feb 20, 1996||Goldstar Co., Ltd.||Scan converter of television receiver and scan converting method thereof|
|US5519453||Aug 3, 1994||May 21, 1996||U. S. Philips Corporation||Method of eliminating interfernce signals from video signals|
|US5737027||Dec 27, 1994||Apr 7, 1998||Demografx||Pixel interlacing apparatus and method|
|US5742343||Aug 19, 1996||Apr 21, 1998||Lucent Technologies Inc.||Scalable encoding and decoding of high-resolution progressive video|
|US5828788||Dec 14, 1995||Oct 27, 1998||Thomson Multimedia, S.A.||System for processing data in variable segments and with variable data resolution|
|US5852565||Jan 30, 1996||Dec 22, 1998||Demografx||Temporal and resolution layering in advanced television|
|US5974159||Mar 28, 1997||Oct 26, 1999||Sarnoff Corporation||Method and apparatus for assessing the visibility of differences between two image sequences|
|US5988863||Dec 21, 1998||Nov 23, 1999||Demografx||Temporal and resolution layering in advanced television|
|US6028634||Jul 8, 1998||Feb 22, 2000||Kabushiki Kaisha Toshiba||Video encoding and decoding apparatus|
|US6111975||Jan 11, 1996||Aug 29, 2000||Sacks; Jack M.||Minimum difference processor|
|US6175592||Mar 12, 1997||Jan 16, 2001||Matsushita Electric Industrial Co., Ltd.||Frequency domain filtering for down conversion of a DCT encoded picture|
|US6252906||Jul 31, 1998||Jun 26, 2001||Thomson Licensing S.A.||Decimation of a high definition video signal|
|US6442203||Nov 5, 1999||Aug 27, 2002||Demografx||System and method for motion compensation and frame rate conversion|
|US6489956 *||Oct 6, 1999||Dec 3, 2002||Sun Microsystems, Inc.||Graphics system having a super-sampled sample buffer with generation of output pixels using selective adjustment of filtering for implementation of display effects|
|US6728317||Apr 7, 2000||Apr 27, 2004||Dolby Laboratories Licensing Corporation||Moving image compression quality enhancement using displacement filters with negative lobes|
|US7106322 *||Dec 29, 2000||Sep 12, 2006||Sun Microsystems, Inc.||Dynamically adjusting a sample-to-pixel filter to compensate for the effects of negative lobes|
|US20020003838||Aug 9, 1999||Jan 10, 2002||Toshiya Takahashi||Image conversion apparatus|
|US20040196901||Mar 30, 2004||Oct 7, 2004||Demos Gary A.||Median filter combinations for video noise reduction|
|CA2127151A1||Jun 30, 1994||Mar 22, 1995||Atul Puri||Spatially scalable video encoding and decoding|
|EP0634871A2||Jul 6, 1994||Jan 18, 1995||AT&T Corp.||Scalable encoding and decoding of high-resolution progressive video|
|JPH01140883A||Title not available|
|JPH06165150A||Title not available|
|JPH06350995A||Title not available|
|JPH07203426A||Title not available|
|WO1997028507A1||Jan 24, 1997||Aug 7, 1997||Demografx||Temporal and resolution layering in advanced television|
|WO2001077871A1||Apr 6, 2001||Oct 18, 2001||Demografx||Enhanced temporal and resolution layering in advanced television|
|1||"IEEE Standard Specification for the Implementations of 8×8 Inverse Discrete Cosine Transforms," IEEE Std 1180-1990, The Institute of Electrical and Electronics Engineers, Inc.; United States of America, 13 pages (1991).|
|2||Certified English Translation for Japanese Patent Publication No. 01-140883, published Jun. 2, 1989, entitled "Data Coding Method".|
|3||English language abstract for JP 06165150, published Jun. 10, 1994, entitled: "Dynamic Picture Coding/Decoding Device".|
|4||European Office Action, European Patent Application No. 01924762.6, dated Oct. 11, 2006, 11 pages.|
|5||Girod, "Motion-Compensating Prediction with Fractional Pel Accuracy", IEEE Transactions on Communications, vol. 41, No. 4, Apr. 1993, pp. 604-612.|
|6||H.261, ITU-T Telecommunication Standardization Sector of ITU, Line Transmission of non-telephone signals. Video Codec for Audiovisual Services at p X64 kbits, (Mar. 1993), 32 pages.|
|7||H.263 Appendix III, ITU-T Telecommunication Standardization Sector of ITU, Series H: Audiovisual and Multimedia Systems, Infrastructure of audiovisual services-coding of moving video. Video coding for low bit rate communication, Appendix III: Examples for H.263 encoder/decoder implementations, (Jun. 2001), 48 pages.|
|8||H.263 Appendix III, ITU-T Telecommunication Standardization Sector of ITU, Series H: Audiovisual and Multimedia Systems, Infrastructure of audiovisual services—coding of moving video. Video coding for low bit rate communication, Appendix III: Examples for H.263 encoder/decoder implementations, (Jun. 2001), 48 pages.|
|9||H.263, ITU-T Telecommunication Standardization Sector of ITU, Series H: Audiovisual and Multimedia Systems, Infrastructure of audiovisual services-coding of moving video. Video coding for low bit rate communication, (Jan. 2005), 226 pages.|
|10||H.263, ITU-T Telecommunication Standardization Sector of ITU, Series H: Audiovisual and Multimedia Systems, Infrastructure of audiovisual services—coding of moving video. Video coding for low bit rate communication, (Jan. 2005), 226 pages.|
|11||ISO/IEC 14496-2 International Standard, Information technology-coding of audio-visual objects-Part 2: visual, 2nd Edition, Amendment 2: Streaming video profile, Feb. 1, 2002, 64 pages.|
|12||ISO/IEC 14496-2 International Standard, Information technology—coding of audio-visual objects—Part 2: visual, 2nd Edition, Amendment 2: Streaming video profile, Feb. 1, 2002, 64 pages.|
|13||ISO/IEC 14496-2 International Standard, Information technology-coding of audio-visual objects-Part 2: visual, 2nd Edition, Dec. 1, 2001, 536 pages.|
|14||ISO/IEC 14496-2 International Standard, Information technology—coding of audio-visual objects—Part 2: visual, 2nd Edition, Dec. 1, 2001, 536 pages.|
|15||ISO/IEC JTC 1, "Coding of audio-visual objects-Part 2: Visual," ISO/IEC 14496-2 (MPEG-4 Part 2), Dec. 1999, 348 pages.|
|16||ISO/IEC JTC 1, "Coding of audio-visual objects—Part 2: Visual," ISO/IEC 14496-2 (MPEG-4 Part 2), Dec. 1999, 348 pages.|
|17||Japanese Office Action, Application Serial No. 2001-574651, dated Mar. 22, 2007, 74 pages.|
|18||Machine Translation for Japanese Patent Publication No. 07-2033426, published Aug. 4, 1995, entitled "Hierarchical Coding and Decoding Device" (22 pages).|
|19||Machine Translation for Japanese Patent Publication No. 07-203426, published Aug. 4, 1995, entitled "Hierarchical Coding and Decoding Device" (22 pages).|
|20||Office Action, U.S. Appl. No. 11/187,176, dated Aug. 30, 2007, 16 pages.|
|21||Office Action, U.S. Appl. No. 11/187,176, dated Jan. 4, 2007, 45 pages.|
|22||Patent Abstract of Japan for Japanese Patent Publication No. 01-140883, published Jun. 2, 1989, entitled "Data Coding Method".|
|23||Patent Abstract of Japan for Japanese Patent Publication No. 07-203426, published Aug. 4, 1995, entitled "Hierarchical Coding and Decoding Device".|
|24||Patent Abstracts of Japan, vol. 1995, No. 03 (Apr. 28, 2005) for Japanese Patent Publication JP 06350995, published Dec. 22, 1994, entitled, "Moving Picture Processing Method".|
|25||Puri et al., "Temporal Resolution Scalable Video Coding," Image Processing. 1994 International Conference, IEEE, pp. 947-951 (1994).|
|26||Shen et al., "Adaptive motion vector resampling for compressed video down-scaling", International Conference on Image Processing, vol. 1, pp. 771-774, Oct. 26-29, 1997.|
|27||Vincent, A., et al., "Spatial Prediction in Scalable Video Coding," International Broadcasting Convention, IEEE Conference Publication No. 413, RAI International Congress and Exhibition Centre, Amsterdam, The Netherlands, Sep. 14-18, 1995, pp. 244-249.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US8767829||Apr 9, 2013||Jul 1, 2014||Dolby Laboratories Licensing Corporation||Switch-select single frame reference|
|US8942285||Jun 11, 2013||Jan 27, 2015||Dolby Laboratories Licensing Corporation||Motion compensation filtering in an image system|
|US8995528||Mar 31, 2014||Mar 31, 2015||Dolby Laboratories Licensing Corporation||Switch-select single frame reference|
|US9002133||Feb 27, 2013||Apr 7, 2015||Sharp Laboratories Of America, Inc.||Multi layered image enhancement technique|
|US9053531||Feb 28, 2013||Jun 9, 2015||Sharp Laboratories Of America, Inc.||Multi layered image enhancement technique|
|U.S. Classification||375/240.29, 375/240.17|
|International Classification||H04N5/44, H04N7/01, H04N7/12, G06T9/00, H04N7/50, H04N9/64, H04N5/14, H04N7/167, H04N7/46, H04N7/26, H04N5/21, H04N7/36|
|Cooperative Classification||H04N19/567, H04N19/86, H04N19/36, H04N19/51, H04N19/31, H04N19/587, H04N19/17, H04N19/33, H04N19/34, H04N5/14, H04N19/523, H04N19/61, H04N19/577, H04N19/467, H04N19/80, H04N19/13, H04N19/59, H04N19/55, H04N5/144, H04N5/4401, H04N7/0122, H04N7/0132, H04N7/012, H04N7/0112, H04N5/21, H04N9/64, H04N21/440263, H04N21/440227, H04N21/440281|
|European Classification||H04N7/01G5, H04N7/46S, H04N7/46T, H04N5/44N, H04N7/01P5, H04N7/26M4R, H04N7/50, H04N7/26A4V, H04N7/26E2, H04N7/26A8R, H04N7/01F, H04N7/26M4C, H04N7/50E2, H04N5/14M, H04N7/36C4, H04N7/26E, H04N7/36C2, H04N5/21, H04N7/26E10, H04N19/00C1, H04N7/26F, H04N7/26P4, H04N5/14, H04N7/01G3, H04N7/46E|