Complexity Reduction Techniques for the Compression of High-Deﬁnition Video

— The state-of-the-art video compression standard, the hybrid predictive-transform H.264/AVC codec, has lead to substantial performance improvement compared to other existing standards. Performing predictions is a computational demanding task and optimizing this stage may result in substantial encoding speedup. In this paper, we propose more efﬁcient approaches to implement the H.264/AVC prediction stage. The ﬁrst idea is to use original data rather than reconstructed ones to perform prediction tests before choosing the best mode. The residue, however, is evaluated using previously decoded data in order to avoid drifting. The second approach is to employ a subset of dominant prediction modes instead of testing all modes recommended by the H.264/AVC standard. The subset is updated frame-by-frame using macroblock sampling. Results for high deﬁnition sequences show that the quality loss is negligible allowing us not only to parallelize the inter-prediction stage but also to reduce total complexity.


I. INTRODUCTION
H ISTORICALLY, processor manufacturers have re- sponded to the demand for more processing power primarily with faster processor speeds.However, higher clock speeds imply in higher power consumption and heat.For that reason, manufacturers have been moving its strategy from pure clock-oriented projects to embrace multi-core architectures.
Image and video processing can be considered as driven forces that motivate this computational power pursuit.Therefore, it is not surprisingly to observe that the state of art video compression standard, H.264/AVC [1], is a computationhungry application.The H.264/AVC coder has been well described in the literature [2]- [3].
When encoding high definition sequences, complexity is an issue and real-time video coding is challenging.As the computational complexity of H.264/AVC is mainly concentrated in the prediction stage, making it more efficient seems to be a key to allow for real-time coding.The present work suggests new strategies in this direction.The first idea proposes to parallelize the prediction module to allow for real-time coding exploring the advent of multi-core architectures.Another idea is to suppress least frequent prediction modes in order to save complexity.
This paper is organized as follows.Section II gives an overview of the H.264/AVC macroblock prediction and analyses prediction complexity.A method to reduce the computational complexity is presented in Section III.The experimental results are shown and discussed in Section IV, while the conclusions are finally presented in Section V.

II. MACROBLOCK PREDICTION IN H.264/AVC
H.264/AVC is a hybrid video codec, i.e. along with a transform module, it has a prediction module, a differential stage and a feedback loop [2].The prediction stage uses previously reconstructed samples as input to the prediction model.This avoids mismatches between encoder and decoder data, allowing for synchronous decoding.However, H.264/AVC has a rather complex prediction stage composed by a set of prediction models.
In Fig. 1, the H.264/AVC encoder block diagram is shown and the prediction stage is highlighted.Note that the coder is divided into temporal (Inter) and spatial (Intra) models."Inter" prediction generates a prediction macroblock from one or more previously encoded video frames using block-based motion estimation and compensation.This model is responsible for almost 90% of the complexity of an H.264/AVC baseline encoder [4].Important advances from earlier video standards include the support for a range of block sizes (16×16 and down, as in Fig. 2), and refined motion vectors (quarter-sample resolution for the luminance component).In "Intra" prediction, a prediction block is formed based on planar extrapolation of previously encoded and reconstructed neighbouring pixels.The prediction is subtracted from the current block, prior to encoding.The 4×4-and 8×8-pixel blocks allow for a total of nine optional prediction modes for luminance, while the 16×16-pixel blocks allow for only four modes as illustrated in Fig. 3.The encoder typically selects the prediction mode for each block that minimizes the difference between the predicted block and the block to be encoded.A prediction for the current macroblock is created from image samples that have already been encoded either in the same frame or in a previously encoded one.This prediction is subtracted from the current macroblock and the residue is compressed and transmitted, along with information required by the decoder in order to repeat the prediction process (motion vectors, prediction modes, etc.).The decoder creates an identical prediction as the encoder, and adds it to the decoded residual block.The encoder bases its prediction on encoded and decoded image samples (rather than on original video frame samples) in order to ensure that the encoder and decoder predictions are identical.
The H.264/AVC prediction stage is built upon a myriad of tests applied to choose the best prediction mode in a ratedistortion (RD) sense.It is intuitive that encoding complexity reduction can be achieved through simplifying the prediction module, particularly the motion estimation step.
Sub-optimal fast motion estimation techniques were proposed [5], [6] and incorporated in the H.264/AVC reference software 1 .In exploring the variety of macroblock partitions available in H.264/AVC, there are works [7], [8] that apply motion estimation only for the most probable partition.Intraprediction tests can also be reduced by means of selection of the most probable best mode according to heuristics [9], [10].
Another approach is to generalize the rate-distortion analysis to add a complexity optimization variable.This concept is well suited to the emerging field of wireless digital video communications, where energy and delay constraints are stringent [11], [12].
In order to profile the H.264/AVC (High Profile) encoder we used gprof2 and JM12.3 to encode a high definition video sequence (Pedestrian Area), with rate-distortion optimization turned on, four-frame reference buffer and fast full-search motion estimation [1].Results are presented in Tables I and  II.Table I indicates H.264/AVC Intra-prediction modes complexity contributions.We observe that the prediction complexity for 4×4 and 8×8-pixel blocks is greater than 16×16 ones.This is due to the larger number (nine) of prediction modes available for 4×4 and 8×8 blocks.
Complexity estimates presented in [4] are extended in Table II, where different motion search window sizes were applied to H.264/AVC high definition video sequence encoding.We observe that the encoder spends great part of execution time in Motion Estimation due to the extensive tests required to find the best match.

A. Open-loop to enable parallelism
In the HD video coding, usually only a low level of distortion is tolerated, which implies a high degree of similarity between the compressed video and its original version.So, if we employ original data in prediction tests instead of reconstructed ones, it is likely that the best prediction mode chosen using original data will be the same mode chosen when using reconstructed data.The proposed method, described in Fig. 4, consists in applying original data to choose the best prediction mode, but to use locally decoded data in the motion compensation process, in order to avoid drifting.
In Fig. 4, the prediction tests do not depend on reconstructed/decoded data as compared to Fig. 1.They only depend on original data.The search for the best prediction mode, the most time consuming stage of an H.264/AVC encoder, can be parallelized given that we do not need to wait for previously encoded/decoded data.Thus, the prediction loop is opened (see Fig. 4).All data/modes used to evaluate the residue are available to the decoder.One can potentially reduce encoding time by a factor of n through the parallel work of n frame prediction engines.
Some aspects of encoder implementation on PC-based platforms still have to be addressed.High-definition encoding requires the processing of a huge amount of data, which should be stored in memory and also transferred to and from the processor.Therefore, bus throughput may be a bottleneck to the whole process.

B. Prediction Mode Bias
The H.264/AVC prediction stage is rather complex due to the many tests of various prediction modes available to each macroblock.For instance, to encode P-frames in H.264/AVC High Profile, we can use the following set of Inter-and Intraframe prediction modes: • P16×16: motion compensated prediction for 16×16 pixels macroblocks; • P16×8: motion compensated prediction for 16×8 pixels macroblocks; • P8×16: motion compensated prediction for 8×16 pixels macroblocks; • P<=8×8: motion compensated prediction for macroblocks whose size is less then or equal to 8×8 pixels; • I16MB: intra prediction for 16×16 pixels macroblocks; • I8MB: intra prediction for 8×8 pixels macroblocks; • I4MB: intra prediction for 4×4 pixels macroblocks; • SKIP: zero residue motion compensated prediction for 16×16 pixels macroblocks.When compressing high definition 1080p video sequences (1920×1080 pixels per frame in progressive scan), we verify that the prediction modes used to encode the signals are often repetitive.The frequency profile of selected prediction modes for different sequences and resolutions, ranging from QCIF (176×144 pixels) to 1080p, is presented in Figs. 5 through 7.
We can observe that as we increase the resolution, prediction modes tend to polarize themselves around larger macroblock partitions, even though the Riverbed sequence does not strictly follow this trend.In general, some computational effort can be saved when encoding high definition video sequences by avoiding small-sized partitions in motion compensated predictions.

C. Reduced Mode Set Prediction
The previous analysis suggests that the encoder can save time if it avoids testing the less frequent prediction modes.In order to achieve complexity reduction based on frequency distribution of best prediction modes, we randomly select macroblocks to preview the frequency distribution of the next frame.Then, we select the dominant modes, i.e. the modes which correspond to 80% of the choices, according to the following algorithm: Let each frame have N macroblocks.For the n-th P-or B-frame 1. Randomly select a set S of N S macroblocks Even though the best mode frequency distribution is not stationary, our tests have shown that this is a good approximation for adjacent frames.Errors in determining the dominant modes will be reflected in a small degradation on the encoder RD performance.
An important issue is the sampling population size, N S , which will be used in the prediction of next frame dominant modes.The smaller N s is, the larger the savings, but the worst the performance.

A. Open-Loop Prediction
The open-loop prediction method was implemented in JM10.2 3 and evaluated using the first 50 frames of standard   test sequences Pedestrian Area, Sunflower, Riverbed and Rushhour.We varied the QP (quantization parameter) in the range 12 ≤ QP ≤ 36.RD plots are presented in Figs. 8 through 11, comparing the methods with and without an open-loop [13].The average PSNR differences between RD-curves are indicated.From the plots, we can observe a negligible quality loss when using original data rather than reconstructed ones in the H.264/AVC prediction step.This is due to eventual mismatches between the best prediction modes.At high rates, the RD curves tend to overlap.This is expected because the lesser the quantization, the closer the original and reconstructed blocks.
The technique was also evaluated at CIF (352×288 pixels) resolution (see Figs. 12 through 14).In this case, performance losses are more significant due to prediction mode mismatches.
The technique was also evaluated for an IPBBBPBBP... GOP configuration, where 21 frames of each HD video sequence were encoded.The results are presented in Fig. 15 to 18.In this case, performance drop appears to be very small and may be acceptable for most applications.

B. Reduced Mode Set Prediction
The statistical analyzer proposed in Sec.III-C was implemented in JM12.3 H.264/AVC reference software.For each sequence we used 20 frames.Fast full-search motion estimation was used and the results were obtained by varying the  Figure 19 presents RD performance curves for "Pedestrian Area" sequence, for different sampling population NS N 100% .We observe that the performance difference is very small.
A more detailed approach is to plot the average difference between the performance curves for different sampling population sizes, as show in Figs.20 and 21.The average PSNR and bitrate differences between RD-curves were evaluated.[13] The time savings are shown in Fig. 22, which relates complexity savings against the population size of fully-tested macroblocks, using the proposed method for different HD sequences.
Figure 22 suggests that the size of the fast-predicted macroblock population for the set of test sequences has a direct relation to the complexity savings.
There is a small performance loss when predicting only through dominant modes due to eventual mismatches.The rate loss is kept below 5% if the population size remains above 10%.Even though computational savings are relatively small, fast full-search motion estimation was already enabled.The encoder carries on 4×4-pixel block motion estimation and proceeds to larger partitions motion estimation by grouping the results (SAD/SSD) of previously stored blocks.Thus, for sequences where the intra-predicted macroblocks are more   frequent, like Riverbed, the computational savings are greater due to the fact that motion estimated prediction modes are not included in the dominant set for some frames.
We also implemented our method for UMHexS motion estimation [5] in JM13.2 H.264/AVC reference software.We computed 50 frames of each sequence and varied the population size from 1% to 50%.Results are presented in Figs.23 through 25.The computational savings are greater than for  fast full-search case.Nevertheless, the rate and quality losses have also increased.
For "Riverbed" sequence, the complexity reduction profile is shown in Fig. 26.This sequence is very challenging because of the difficulty in getting good matches through motion estimation, incurring in a high number of intra-coded macroblocks.This characteristic was properly "tracked" by our algorithm.Complexity reduction profile plots from HD  From Figs. 28 to 30 we observe that the computational savings, besides inferior to UMHexS results, remain greater than for fast full-search case.Nevertheless, the rate and quality losses have also increased.Again, the algorithm was capable of tracking some specific features from the test sequences: "Riverbed" suffered lesser penalties and achieved the best complexity savings because the algorithm could avoid tests of motion compensated modes due to their lower statistic occurrence.The performance of the proposed technique can still be improved because the authors believe that separating the dominant modes obtained from P-frames from the ones obtained from B-frames allows for a better match of the next dominant modes.In the actual implementation, the motion compensated frames use previous frames statistics to decide which are the dominant modes; we suggest that P-frames statistics may be used to predict dominant modes in the next P-frame meanwhile  B-frames statistics may be used to predict dominant modes in the next B-frame.

V. CONCLUSIONS
We propose a method to search for the best prediction mode in H.264/AVC for high-definition sequences.Rather than using previously decoded macroblocks, we propose to use the original macroblocks.In other words, we open the prediction loop.While the original image data is used for the prediction mode decision, the residue is formed using locally decoded data.Hence, drifting is avoided.In our tests, the performance loss is negligible in most cases.Its main advantage is that it allows for the parallelization of the implementation since all prediction modes and motion vectors can be tested simultaneously.This may enable real-time H.264/AVC compression of HD material using massively-parallel computing systems that do not share memory buses.
Another contribution is a reduced-complexity method to carry the prediction mode tests in H.264/AVC for highdefinition sequences.Instead of testing all available prediction modes, we search for a "dominant" mode subset.Our tests have shown that the RD performance is barely affected by the prediction mode test skipping, while achieving significant complexity reduction.The method does not require a new decoder implementation because only non-normative codec aspects are modified.
As a future research, we plan to implement a complexitycontrolled H.264/AVC encoder based on a test-skipping strategy.We also plan to benchmark video compression on massively-parallel computing systems.

Fig. 4 .
Fig. 4. Proposed parallel prediction structure (a) and the new encoder scheme for parallel prediction.
Fig. 16.Rate-distortion curves for HD sequence Rush Hour encoded according to IPBBBPBBP... GOP.OL stands for open-loop.

Fig. 23 .
Fig. 23.Time savings vs. Population size for different HD video sequences using UMHexS motion estimation.

Fig. 24 .
Fig. 24.Average PSNR Drop vs. Population size for different HD video sequences using UMHexS motion estimation.

Fig. 25 .
Fig. 25.Average Rate Increase vs. Population size for different HD video sequences using UMHexS motion estimation.

TABLE I RELATIVE
COMPUTATIONAL COMPLEXITY FOR HD PEDESTRIAN AREA INTRA-FRAME ONLY, IN JM12.3 H.264/AVC High Profile CODING.