The two series of video codec nomenclature H.26x and MPEG-x overlap. MPEG-2 is named H.262 in the H.26x scheme. Likewise, another popular codec, H.264, is a subset of MPEG-4 also known as MPEG-4 Advanced Video Coding (AVC). Its intent, like that of all of MPEG-4, was to produce video compression of acceptable quality and very low bit-rate -- around half of its predecessors MPEG-2 and H.263.
Like its predecessors in the H.26x video codec family, H.264 has two encoding modes for individual video frames -- intra and inter. In the former, a frame of video is encoded as a stand-alone image without reference to other images in the sequence. In the latter, the previous and possibly future frames are used to predict the values. Figure 1 shows the high-level blocks involved in intra-frame encoding and decoding of H.264. Figure 2 shows the encoding and decoding process for inter frames.
Whether in inter or intra frames, blocks in H.264 can be expressed relative to previous and subsequent blocks or frames. In inter frames, this is called "motion estimation" and is relative to blocks in other frames. This is the source of considerable compression. As with other video compression techniques, this exploits the fact that there is considerably less entropy in the difference between similar blocks than in the absolute values of the blocks. This is particularly true if the difference can be between a block and a constructed block at an offset from that block in another frame.
H.264 has very flexible support for motion estimation. The estimation can choose from 32 other frames as reference images, and is allowed to refer to blocks that have to be constructed by interpolation.
The encoder is responsible for determining a reference image, block and motion vector. This block is generally chosen using some search among the possibilities, starting with the most likely options. The encoder then calculates and encodes the difference between previously encoded blocks and the new data.
On the decoding end, after decoding the reference blocks, the code adds the reference data and the decoded difference data. The blocks and frames are likely to be decoded in non-temporal order, since the frames can be encoded relative to forward-looking blocks and frames.
H.264 encoding supports sub-pixel resolution for motion vectors, meaning that the reference block is actually calculated by interpolating inside a block of real pixels. The motion vectors for luma blocks are expressed at quarter-pixel resolution, and for chroma blocks the accuracy can be eighth-pixel accuracy.
This sub-pixel resolution increases the algorithmic and computational complexity significantly. The decoding portion, which requires performing sub-pixel motion compensation only once per block, takes about 10 to 20 percent of decoding pipeline. The bulk of this time is spent interpolating values between pixels to generate the sub-pixel-offset reference blocks. The cost of performing sub-pixel estimation varies with the encoding algorithm, but may require performing motion compensation more than once.
The interpolation algorithm to generate offset reference blocks is defined differently for luma and chroma blocks. For luma, interpolation is performed in two steps, half-pixel and then quarter-pixel interpolation. The half-pixel values are created by filtering with this kernel horizontally and vertically:
[1 -5 20 20 -5 1]/32
Quarter-pixel interpolation is then performed by linearly averaging adjacent half-pixel values.
Motion compensation for chroma blocks uses bilinear interpolation with quarter-pixel or eighth-pixel accuracy, depending on the chroma format. Each sub-pixel position is a linear combination of the neighboring pixels.
Figure 3 illustrates which pixels are thus used for both interpolation approaches.
After interpolating to generate the reference block, the algorithm adds that reference block to the decoded difference information to get the reconstructed block. The encoder executes this step to get reconstructed reference frames, and the decoder executes this step to get the output frames.