Transformation and Quantization
In Intel IPP functions, transform and quantization functionality are merged for more efficiency. There are four functions for the decoding of H.264:
- ippiTransformDequantLumaDC_H264_16s_C1I
- ippiTransformDequantChromaDC_H264_16s_C1I
- ippiDequantTransformResidual_H264_16s_C1I
- ippiDequantTransformResidualAndAdd_H264_16s_C1I
There are analogous functions for encoding:
- ippiTransformQuantLumaDC_H264_16s_C1I
- ippiTransformQuantChromaDC_H264_16s_C1I
- ippiTransformQuantResidual_H264_16s_C1I
Additional functions handle 8x8 blocks.
Listing 3 lists a block of code from the H.264 that uses these functions.
The cbp4x4 variable is a bitmask indicating whether there are any DC coefficients within the macroblock that have any data, and individually whether each residual (AC) block within the macroblock has any data. The QP variable contains the Quality Parameter that specifies the degree of quantization.
If the bitmask indicates that there is any DC luma data, the code transforms it with the ippiTransformDequantLumaDC function. Then the code iterates over the 16 blocks within the macroblock. For each block, if there is either DC data or residual data, the code will transform and dequantize the block. It will pass in the decoded DC coefficient, which might be 0, the buffer of residual data along with a flag indicating whether the residual data is valid, and the Quality Parameter.
if ((cbp4x4 & (IPPVC_CBP_LUMA_AC | IPPVC_CBP_LUMA_DC)) != 0) { Ipp16s *pDC; Ipp16s DCCoeff; Ipp16s *tmpbuf; /* bit var to isolate cbp for block being decoded */ Ipp32u uCBPMask = (1 << IPPVC_CBP_1ST_LUMA_AC_BITPOS); if ((cbp4x4 & IPPVC_CBP_LUMA_DC) != 0) { luma_dc = (*ppSrcCoeff); *ppSrcCoeff += 16; ippiTransformDequantLumaDC_H264_16s_C1I(luma_dc, QP); } tmpbuf = 0; /* init as no ac coeffs */ pDC = 0; /* init as no dc */ ac_coeffs = pDstCoeff; for (Ipp32s uBlock = 0; uBlock < 16; uBlock++, uCBPMask <<= 1) { DCCoeff = (Ipp16s)luma_dc[block_subblock_mapping[uBlock]]; if (DCCoeff != 0) pDC = &DCCoeff; /* dc coeff presents */ if ((cbp4x4 & uCBPMask) != 0) { memcpy(pDstCoeff, *ppSrcCoeff, 16*sizeof(Ipp16s)); tmpbuf = pDstCoeff; pDstCoeff += 16; *ppSrcCoeff += 16; } Ipp32s hasAC = tmpbuf != 0; if (tmpbuf || pDC) { if (!pDC) { if (tmpbuf) { if (dc_present) tmpbuf[0] = 0; } } else { if (!tmpbuf) { tmpbuf = pDstCoeff; pDstCoeff += 16; cbp4x4 |= uCBPMask; } } ippiDequantTransformResidual_H264_16s_C1I(tmpbuf, 8, pDC, hasAC, QP); tmpbuf = 0; pDC = 0; } } }
Deblocking Filter
The Intel IPP functions that perform filtering on the edges of macroblocks are divided according to horizontal and vertical edges, luma and chroma blocks, block size, bit depth, and sampling rate. They are the following:
- ippiFilterDeblockingLuma_VerEdge_H264_[8u|16u]_C1IR
- ippiFilterDeblockingLuma_HorEdge_H264_[8u|16u]_C1IR
- ippiFilterDeblockingChroma_HorEdge[422|444]_H264_[8u|16u]_C1IR
- ippiFilterDeblockingChroma_VerEdge[422|444]_H264_[8u|16u]_C1IR
- ippiFilterDeblockingLuma_VerEdge_MBAFF_H264_[8u|16u]_C1IR
- ippiFilterDeblockingChroma_VerEdge_MBAFF_H264_[8u|16u]_C1IR
The MBAFF versions of the function filter 16x8 blocks instead of 16x16 and are intended for use with interlaced video.
Slightly different variations of some of these functions take a structure of parameters instead of pushing all of the parameters on the stack. These provide a slight performance improvement due to decreased stack usage.
Listing 4 shows a code snippet that executes a deblocking filter. The behavior of the filters are determined by the alpha, beta, and clipping thresholds, and the filter strength arrays. The alpha parameter is the threshold for gradient across the edges, while the beta parameter is the threshold for gradient on one side of an edge. The clipping thresholds, held in the array Clipping and called tc0 in the standard, limit the effect of the filter. The threshold parameters are based on fixed tables, indexed by the Quality Parameter (QP) plus a tuning factor. The strength parameter pStrength, which is referred to as bS in the standard, affects the deblocking filter in a number of ways, including the basic algorithm. Both the tables and the formulas used in to calculate the indices are taken from the H.264 standard.
For simplicity, this code uses simple wrapper functions around each of the Intel IPP functions. The wrappers adapt the arguments and provide a uniform prototype for all the deblocking filters, but do not do any computation. Since they have a uniform prototype, the function calls them indirectly, according to a table set elsewhere.
Ipp8u BETA_TABLE[52] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18 }; .{ ... IppStatus ( *(IppDeblocking[])) (Ipp8u *, Ipp32s, Ipp8u *, Ipp8u *, Ipp8u *, Ipp8u *, Ipp32s ) = { &(FilterDeblockingLuma_VerEdge), &(FilterDeblockingLuma_HorEdge), &(FilterDeblockingChroma_VerEdge), &(FilterDeblockingChroma_HorEdge), &(FilterDeblockingChroma422_VerEdge), &(FilterDeblockingChroma422_HorEdge), &(FilterDeblockingChroma444_VerEdge), &(FilterDeblockingChroma444_HorEdge), &(FilterDeblockingLuma_VerEdge_MBAFF), &(FilterDeblockingChroma_VerEdge_MBAFF) }; IppStatus ( *(IppDeblocking16u[])) (Ipp16u *, Ipp32s, Ipp8u *, Ipp8u *, Ipp8u *, Ipp8u *, Ipp32s ) = { &(FilterDeblockingLuma_VerEdge), &(FilterDeblockingLuma_HorEdge), &(FilterDeblockingChroma_VerEdge), &(FilterDeblockingChroma_HorEdge), &(FilterDeblockingChroma422_VerEdge), &(FilterDeblockingChroma422_HorEdge), &(FilterDeblockingChroma444_VerEdge), &(FilterDeblockingChroma444_HorEdge), &(FilterDeblockingLuma_VerEdge_MBAFF), &(FilterDeblockingChroma_VerEdge_MBAFF) }; // internal edge variables QP = pmq_QP; index = IClip(0, 51, QP + BetaOffset); Beta[1] = (Ipp8u) (BETA_TABLE[index]); index = IClip(0, 51, QP + AlphaC0Offset); Alpha[1] = (Ipp8u) (ALPHA_TABLE[index]); pClipTab = CLIP_TAB[index]; // create clipping values { Ipp32s edge; for (edge = 1;edge < 4;edge += 1) { if (*((Ipp32u *) (pStrength + edge * 4))) { // create clipping values Clipping[edge * 4 + 0] = (Ipp8u) (pClipTab[pStrength[edge * 4 + 0]]); Clipping[edge * 4 + 1] = (Ipp8u) (pClipTab[pStrength[edge * 4 + 1]]); Clipping[edge * 4 + 2] = (Ipp8u) (pClipTab[pStrength[edge * 4 + 2]]); Clipping[edge * 4 + 3] = (Ipp8u) (pClipTab[pStrength[edge * 4 + 3]]); } } } if (pParams->bitDepthLuma > 8) { IppDeblocking16u[dir]((Ipp16u*)pY, pic_pitch, Alpha, Beta, Clipping, pStrength, pParams->bitDepthLuma); } else { IppDeblocking[dir](pY, pic_pitch, Alpha, Beta, Clipping, pStrength, pParams->bitDepthLuma); } }
Threading and Video Coding
H.264 and MPEG-4 in general are amenable to threading. Listing 5 shows the key piece of code from the Intel IPP codec sample for H.264 that uses one OpenMP pragma to parallelize this encoder.
The key aspect of this code is the slice. The slice is defined as an independent segment of the image, a segment that neither uses other video slices for reference in prediction is used for reference by other video slices. That makes it the perfect level for parallelization, as the codec can process multiple slices simultaneously and not be forced into serial mode by motion compensation.
template <class PixType, class CoeffsType> Status H264CoreEncoder<PixType,CoeffsType>::CompressFrame( EnumPicCodType & ePictureType, EnumPicClass & ePic_Class, MediaData* dst) { Status status = UMC_OK; Ipp32s slice; for (m_field_index=0; m_field_index <= (Ipp8u) (m_pCurrentFrame->m_PictureStructureForDec< FRM_STRUCTURE); m_field_index++) { ... #if defined _OPENMP vm_thread_priority mainTreadPriority = vm_get_current_thread_priority(); #pragma omp parallel for private(slice) #endif // _OPENMP for (slice = (Ipp32s)m_info.num_slices*m_field_index; slice < m_info.num_slices*(m_field_index+1); slice++) { #if defined _OPENMP vm_set_current_thread_priority(mainTreadPriority); #endif // _OPENMP UpdateRefPicList(m_Slices + slice, m_pCurrentFrame->GetRefPicLists(slice), m_SliceHeader, &m_ReorderInfoL0, &m_ReorderInfoL1); // Compress one slice if (m_is_cur_pic_afrm) m_Slices[slice].status = Compress_Slice_MBAFF(m_Slices + slice); else{ m_Slices[slice].status = Compress_Slice(m_Slices + slice, slice == m_info.num_slices*m_field_index); } ... }