H.264 in Intel IPP
The most computationally intensive part of motion compensation in H.264 is generating the reference blocks. Since H.264 permits sub-pixel offsets from the actual data, the implementation must use a particular interpolation filter to calculate the blocks.
The Intel IPP defines a set of interpolation functions to handle interpolation at different locations in the image. The functions are the following:
- ippiInterpolateLuma_H264_[8u|16u]_C1R
- ippiInterpolateLumaTop_H264_[8u|16u]_C1R
- ippiInterpolateLumaBottom_H264_[8u|16u]_C1R
- ippiInterpolateLumaBlock_H264_[8u|16u]_C1R
- ippiInterpolateChroma_H264_[8u|16u]_C1R
- ippiInterpolateChromaTop_H264_[8u|16u]_C1R
- ippiInterpolateChromaBottom_H264_[8u|16u]_C1R
- ippiInterpolateChromaBlock_H264_[8u|16u]_C1R
These functions are divided into those handling the luma or brightness plane and those handling the chroma or color planes. They are also divided between those functions that handle blocks for which all the data is present and those that occur on a frame boundary outside which there is no data.
The functions that handle all blocks not on the edge of a frame, functions ippiInterpolateLuma_H264 and ippiInterpolateChroma_H264, do not consider the integral portion of the motion vectors. They only perform the interpolation. The input pointer for the reference data should already point to the integral-offset reference block. The functions then calculate the interpolated reference block, using the 2 or 3 bits specifying the fractional motion vector at quarter- or eighth-pixel resolution.
Of the other functions, those with Top or Bottom in the function name interpolate data at the edge of the image. The parameters tell them how far outside the image the reference block is. The function generates that data outside that doesn't exist by replicating the border row, then performs the interpolation as usual.
The remaining function type, that with Block in the function name, performs the interpolation on a reference block entirely within the image, but also takes the entire motion vector so that it can take care of the offset calculation. Listing 1 shows these functions in action.
The function SelectPredictionMethod determines whether the algorithm needs to employ the border versions of the functions. The rest of the code is from another, unspecified function.
The bulk of the function prepares all of the arguments to the interpolation functions. The variables mvx and mvy hold the complete motion vectors. This code sets the variables xh and yh to the low bits of the motion vector, the fractional portion. Then, after clipping the motion vectors to lie within a maximum range, the code sets the variables xint and yint to the integral portion of the motion vector. Finally, it calculates the pointer to the offset reference block and calls the appropriate Intel IPP function.
Note that the edge replication seems only to be an issue at the top and bottom and not the sides. This is because the replication at the top and bottom boundaries takes place at the macroblock level, but the left and right boundaries are replicated at the frame level.
inline Ipp8s SelectPredictionMethod(Ipp32s MBYoffset,Ipp32s mvy, Ipp32s sbheight,Ipp32s height) { Ipp32s padded_y = (mvy&3)>0?3:0; mvy>>=2; if (mvy-padded_y+MBYoffset<0) { return PREDICTION_FROM_TOP; } if (mvy+padded_y+MBYoffset+sbheight>=height) { return PREDICTION_FROM_BOTTOM; } return ALLOK; } { ... // set pointers for this subblock pMV_sb = pMV + (xpos>>2) + (ypos>>2)*4; mvx = pMV_sb->mvx; mvy = pMV_sb->mvy; ... xh = mvx & (INTERP_FACTOR-1); yh = mvy & (INTERP_FACTOR-1); Ipp8u pred_method = 0; if (ABS(mvy) < (13 << INTERP_SHIFT)) { if (is_need_check_expand) { pred_method = SelectPredictionMethod( mbYOffset+ypos, mvy, roi.height, height); } } else { pred_method = SelectPredictionMethod( mbYOffset+ypos, mvy, roi.height, height); mvy = MIN(mvy, (height - ((Ipp32s)mbYOffset + ypos + roi.height - 1 - D_MV_CLIP_LIMIT))*INTERP_FACTOR); mvy = MAX(mvy, -((Ipp32s)(mbYOffset + ypos + D_MV_CLIP_LIMIT)*INTERP_FACTOR)); } if (ABS(mvx) > (D_MV_CLIP_LIMIT << INTERP_SHIFT)) { mvx = MIN(mvx, (width - ((Ipp32s)mbXOffset + xpos + roi.width - 1 - D_MV_CLIP_LIMIT))*INTERP_FACTOR); mvx = MAX(mvx, -((Ipp32s)(mbXOffset + xpos + D_MV_CLIP_LIMIT)*INTERP_FACTOR)); } mvyc = mvy; xint = mvx >> INTERP_SHIFT; yint = mvy >> INTERP_SHIFT; pRef = pRefY_sb + xint + yint * pitch; switch(pred_method) { case ALLOK: ippiInterpolateLuma_H264_8u_C1R(pRef, pitch, pTmpY, nTmpPitch, xh, yh, roi); break; case PREDICTION_FROM_TOP: ippiInterpolateLumaTop_H264_8u_C1R(pRef, pitch, pTmpY, nTmpPitch, xh, yh, - ((Ipp32s)mbYOffset+ypos+yint),roi); break; case PREDICTION_FROM_BOTTOM: ippiInterpolateLumaBottom_H264_8u_C1R(pRef, pitch, pTmpY, nTmpPitch, xh, yh, ((Ipp32s)mbYOffset+ypos+yint+roi.height)- height,roi); break; default:VM_ASSERT(0); break; } }
Intra Prediction
The Intel IPP has three functions for prediction as applied to intra blocks. They are ippiPredictIntra_4x4_H264_8u_C1IR for 4x4 blocks, ippiPredictIntra_16x16_H264_8u_C1IR for 16x16 blocks, and ippiPredictIntraChroma8x8_H264_8u_C1IR for chroma blocks.
These functions take as arguments a pointer to the location of the block start and the buffer's step value, the prediction mode as in Listing 2, and a set of flags indicating which data blocks up or to the left are available. Listing 2 lists code using these functions to perform prediction.
There are three paths in this code: 16x16, 8x8, and 4x4. The 16x16 blocks call ippiPredictIntra immediately. The 8x8 call AddResidualAndPredict8x8 and the 4x4 call AddResidualAndPredict. The smaller blocks are organized into separate functions because of how relatively complicated they are. The smaller blocks involve many types of boundaries with other blocks, and a loop within the macroblock. Of these functions, only the 4x4 version is shown. The 8x8 version is nearly identical.
These prediction functions use a particular algorithm from the standard to calculate a reference block from previous blocks. The mode determines the direction of the data of interest, and then the algorithm calculates a prediction for each pixel based on average of one or more available pixels in that direction.
This code takes the mode, already calculated elsewhere, as an argument. So the bulk of the code is dedicated to determining which outside reference blocks are available and calculating the block locations in memory. The border blocks are available if the predicted block is not on that border with another macroblock, or if the edge_type variable does not indicates that this macroblock is on a global (frame) edge. After calculating the predicted block, each of the two functions AddResidualAndPredict adds the residual using some flavor of motion compensation function starting with ippiMC, using full-pel resolution.
void AddResidualAndPredict(Ipp16s ** luma_ac, Ipp8u * pSrcDstPlane, Ipp32u step, Ipp32u cbp4x4, const IppIntra4x4PredMode_H264 *pMBIntraTypes, Ipp32s edge_type, bool is_half, Ipp32s bit_depth) { Ipp32s srcDstStep = step; Ipp8u * pTmpDst = pSrcDstPlane; /* bit var to isolate cbp for block being decoded */ Ipp32u uCBPMask = (1 << IPPVC_CBP_1ST_LUMA_AC_BITPOS); for (Ipp32s uBlock = 0; uBlock < (is_half ? 8 : 16); uBlock++, uCBPMask <<= 1) { pTmpDst = pSrcDstPlane; Ipp32s left_edge_subblock = left_edge_tab16[uBlock]; Ipp32s top_edge_subblock = top_edge_tab16[uBlock]; Ipp32s top = top_edge_subblock && (edge_type & IPPVC_TOP_EDGE); Ipp32s left = left_edge_subblock && (edge_type & IPPVC_LEFT_EDGE); Ipp32s top_left = ((top || left) && (uBlock != 0)) || ((edge_type & IPPVC_TOP_LEFT_EDGE) && (uBlock == 0)); Ipp32s top_right = (top && (uBlock != 5)) || (!above_right_avail_4x4[uBlock]) || ((edge_type & IPPVC_TOP_RIGHT_EDGE) && (uBlock == 5)); Ipp32s avail = (left == 0)*IPP_LEFT + (top_left == 0)*IPP_UPPER_LEFT + (top_right == 0)*IPP_UPPER_RIGHT + (top == 0)*IPP_UPPER; ippiPredictIntra_4x4_H264_8u_C1IR(pTmpDst, srcDstStep, pMBIntraTypes[uBlock], avail); if ((cbp4x4 & uCBPMask) != 0) { const Ipp8u * pTmp = pSrcDstPlane; ippiMC4x4_8u_C1(pTmp, srcDstStep, *luma_ac, 8, pSrcDstPlane, srcDstStep, IPPVC_MC_APX_FF, 0); *luma_ac += 16; } pSrcDstPlane += xyoff[uBlock][0] + xyoff[uBlock][1]*srcDstStep; } } { ... Ipp32s availability = ((edge_type & IPPVC_LEFT_EDGE) == 0)*IPP_LEFT + ((edge_type & IPPVC_TOP_LEFT_EDGE) == 0)*IPP_UPPER_LEFT + ((edge_type & IPPVC_TOP_RIGHT_EDGE) == 0)*IPP_UPPER_RIGHT + ((edge_type & IPPVC_TOP_EDGE) == 0)*IPP_UPPER; if (mbtype == MBTYPE_INTRA_16x16) { ippiPredictIntra_16x16( context->pYPlane + offsetY, rec_pitch_luma, (IppIntra16x16PredMode_H264) pMBIntraTypes[0], availability); if (luma_ac) AddResidual(luma_ac, context->pYPlane + offsetY, rec_pitch_luma, sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma, sd->bit_depth_luma); } else // if (intra16x16) { if (is_high_profile) { switch (special_MBAFF_case) { default: if (pGetMB8x8TSFlag(sd->m_cur_mb.GlobalMacroblockInfo)) { AddResidualAndPredict_8x8( &luma_ac, context->pYPlane + offsetY, rec_pitch_luma, sd->m_cur_mb.LocalMacroblockInfo->cbp, (IppIntra8x8PredMode_H264 *) pMBIntraTypes, edge_type_2t, true, sd->bit_depth_luma); AddResidualAndPredict_8x8( &luma_ac, context->pYPlane + offsetY + 8*rec_pitch_luma, rec_pitch_luma, sd->m_cur_mb.LocalMacroblockInfo->cbp >> 2, (IppIntra8x8PredMode_H264 *) pMBIntraTypes + 2, edge_type_2b, true, sd->bit_depth_luma); } else { AddResidualAndPredict( &luma_ac, context->pYPlane + offsetY, rec_pitch_luma, sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma, (IppIntra4x4PredMode_H264 *) pMBIntraTypes, edge_type_2t, true, sd->bit_depth_luma); AddResidualAndPredict( &luma_ac, context->pYPlane + offsetY + 8*rec_pitch_luma, rec_pitch_luma, sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma >> 8, (IppIntra4x4PredMode_H264 *) pMBIntraTypes + 8, edge_type_2b, true, sd->bit_depth_luma); } break; case 0: if (pGetMB8x8TSFlag(sd->m_cur_mb.GlobalMacroblockInfo)) { AddResidualAndPredict_8x8( &luma_ac, context->pYPlane + offsetY, rec_pitch_luma, sd->m_cur_mb.LocalMacroblockInfo->cbp, (IppIntra8x8PredMode_H264 *) pMBIntraTypes, edge_type, false, sd->bit_depth_luma); } else { AddResidualAndPredict( &luma_ac, context->pYPlane + offsetY, rec_pitch_luma, sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma, (IppIntra4x4PredMode_H264 *) pMBIntraTypes, edge_type, false, sd->bit_depth_luma); } break; } } else { switch (special_MBAFF_case) { default: AddResidualAndPredict( &luma_ac, context->pYPlane + offsetY, rec_pitch_luma, sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma, (IppIntra4x4PredMode_H264 *) pMBIntraTypes, edge_type_2t, true, sd->bit_depth_luma); AddResidualAndPredict( &luma_ac, context->pYPlane + offsetY + 8*rec_pitch_luma, rec_pitch_luma, sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma >> 8, (IppIntra4x4PredMode_H264 *) pMBIntraTypes + 8, edge_type_2b, true, sd->bit_depth_luma); break; case 0: AddResidualAndPredict( &luma_ac, context->pYPlane + offsetY, rec_pitch_luma, sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma, (IppIntra4x4PredMode_H264 *) pMBIntraTypes, edge_type, false, sd->bit_depth_luma); break; } } ... }