Max Fomitchev is the author of .NET Programming with Visual C++ and can be contacted at [email protected].
When developing numerical or signal-processing software, you eventually have to choose between using off-the-shelf libraries or writing your own routines. Popular off-the-shelf C/C++ math libraries include the Mathwork's MatLab C/C++ library, the Intel Performance Primitives (IPP) library, and Microsoft's Visual C++ Standard Template Library (STL), among others. While each has its advantages, selecting one over the other usually comes down to performance versus development time versus universality trade-offs.
For instance, the idea of converting MatLab M-files into C/C++ source files seems attractive. After conversion, MatLab mathematical entities such as matrices, vectors, and arrays are represented as C++ objects. But the simplicity of objects comes at a price of lost performance, both through frequent memory allocation and virtual function calling/object management overhead. On the other hand, converting to C yields code that is cumbersome and harder to work with. In both cases, the C/C++ Library source code is locked out and not available for modifications.
The situation is similar with the Intel Performance Primitives library. The library's performance is blazing; however, the unavailability of the source code makes it almost useless when interface or algorithm modifications are necessary. Part of the reason Intel opted not to include the source code with IPP is that most of it is hand-optimized x86 assembler peppered with MMX and SSE instructions that have been extensively reordered to achieve maximum processor pipeline throughput. Needless to say, working with such source code is a programmer's nightmare.
Microsoft's STL is included with every distribution of Visual C++. Although not really a math library, it does provide templates for COMPLEX data types, VALARRAY dynamic arrays, and implements basic math operations, sorting, and iterator routines.
Using off-the-shelf libraries is a good start, but developing your own routine pays off in the long run. For instance, while working on an ultrasound imaging software project I used IPP, mostly because of performance considerations. Still, as time went by, I found that it was becoming increasingly difficult to work around the library's limitations. Also, I wanted to cut corners in some places, but the unavailability of the source code stopped me from doing so.
With that in mind, I developed a set of generic routines that is both flexible and fast. While the need for high performance ruled out objects, the idea of using an archaic procedural approach did not seem attractive either. My approach was to use objects without object overheadno virtual functions, no dynamic memory allocation, and inline operators. Another optimization trick included switching off stack frame generation (the \Oy option in Visual C++) that can impair inline function performance (granted, this is compiler specific).
For instance, the COMPLEX class (struct) has only two data members:
double Re, Im; };
and defines all the arithmetic operations as external inline operators:
__inline COMPLEX operator +(COMPLEX& A, COMPLEX& B) { COMPLEX C; C.Re = A.Re + B.Re; C.Im = A.Im + B.Im; return C; }
But there is more to these routines than just a nifty notation for basic operations on complex numbers. Since most of the operations are done on arrays or vectors of a known size, memory allocation can be dramatically reduced and certain operations can be done in-place.
In short, this approach adheres to the three golden rules of optimization:
- Compute only what you absolutely have to (that is, process only the data that changes).
- Dynamically allocate memory only when it is impossible to guess how much memory will be needed beforehand.
- Do not move data around unless you have a very good reason and do your operations in-place.
With this in mind, I present in this article a set of linear filtering routines, including Fourier transforms, convolution, and wavelet transformations. The C source code for these routines and a test program are available at http://www.cuj.com/code/.
Also, for versatility, you can utilize template representation for complex numbers and one-dimensional arrays:
template<class T> struct COMPLEX { T Re, Im; }; // ARRAY data type template<class T, int N> struct ARRAY { T data[N]; T& operator [](int index) { return data[index]; } T& operator =(const ARRAY<T, N>& a) { memcpy(data, a.data, sizeof(data)); return *this; } T* getPtr() { return data; } }; // COMPLEX ARRAY data type template<class T, int N> struct CARRAY: public ARRAY<COMPLEX<T>, N> { };
These templates let you choose the desired data representation (double, float, int, and so on) based on the precise requirements of the task at hand. I developed a similar set of inline operators to support basic array operations, such as summation, multiplication taking complex conjugate, and the like.
Moreover, you can express a linear filtering operation in frequency domain simple as S=F/B, where S, F, and B are CARRAY variables of a given precision and length. This is simple and clear, and the performance stays good.
Linear Filtering
Filtering is the most basic operation of digital-signal processing. In a general sense, the term "filtering" embraces almost all operations that alter input data, so you must be more specific in what type of filtering is used. Here I focus on techniques for linear filtering and their implementation in C/C++.
The term "linear filtering" comes from the definition of a linear system and refers to any type of convolution (time domain) or frequency domain filtering. Figure 1 is an example linear filter.
Many image-processing operationscompression, feature detection/enhancement, noise reduction, and even wavelet transformationare affected by linear filtering. For example, the frequency domain convolution of the functions s(t) and f(t) becomes a product of their Fourier transforms S and F:
s(t)*f(t) -> SF
or
s(t)*f(t) = iFT(FT(s)FT(f))
The only big difference between time-domain and frequency-domain filtering is the nature of the source datawhether amplitude versus time, or amplitude versus frequency, samples are given. If necessary, the data can be converted from one domain to another to select the most appropriateand less computationally expensivefiltering method.
In the frequency domain, the linear filtering operation is straightforward and corresponds to the multiplication of two arrays of complex numbers. The main computational load is handled by forward and inverse Fourier transform routines.
Fourier Transform
Figure 1 gives the efficient implementation of Fast Forward and Inverse Fourier Transform. It is convenient to implement these routines in C++ since complex math is made simple by means of operator overloading. Both the FFT() and iFFT() routines use precalculated sin/cos tables generated by function FFTprep(). The implementation of iFFT() is particularly straightforward:
template<class T, int N> ARRAY<COMPLEX<T>, N>& iFFT(ARRAY<COMPLEX<T>, N>& F) { // Normalize scale(F, ((T)1/N); return conj(FFT(conj(F))); }
The data normalization iFFT() is optional and can be moved to the FFT() routine. In this case, the normalization parameter 1/N must be set replaced with sqrt(1/N), where N is the number of sampling points. For the best performance, it makes more sense to do the normalization once in the iFFT() routine rather than twice in FFT() and iFFT(). The conj() function computes an in-place complex conjugate of the source vector: f[i].Im=-f[i].Im.
Most functions in IPP return a pointer to the source data to make possible elegant notation such as conj(FFT(conj(F)).
Inverse Filtering
Inverse filtering is the simplest case of frequency-domain filtering. Consider a situation where the source signal S is distorted by a known linear filter F (Figure 2). In this case, it is easy to reconstruct the original signal S by filtering S' with a pure inverse filter 1/F.
When calculating an inverse filter, it becomes necessary to calculate 1/f. The trick is to avoid the seemingly inevitable division by zero by setting f[i] to zero:
for ( int i = 0; i < L; i++ ) if ( f[i].Re != 0 || f[i].Im != 0 ) f[i] = 1.0f/f[i]; else { f[i].Re = 0; f[i].Im = 0; }
This approach lets you avoid numerical exceptions. Also, the calculated inverse filter reconstructs the source signal S, when the source signal has no power in the frequency band S[i]. However, when S[i] is nonzero, F[i] works like an ideal band-stop filter and the recovery of lost harmonics is impossible.
Wiener Filter
Generally, when the source signal is contaminated by random additive noise, pure inverse filters do a poor job of recovering the data. It is well known that in the presence of additive noise, Wiener filters are a better solution. For a linear system with transfer function G, Example 1 defines the signal power spectrum Ps and noise power spectrum Pn Wiener filter W. Without the noise present (Ps=0), the Wiener filter exactly matches the pure inverse filter: W=1/G. The recovery procedure of the time-domain signal s(t) from the contaminated signal f(t)->F can be expressed as in Example 2 or, using the IPP library:
s = iFFT(FFT(f) *= (Ps/(Ps + Pn) /= FFT(g)));
This Wiener filter implementation takes advantage of in-place processing. Although "+" and "/" operators do generate new temporary arrays (that are statically allocated on the stack), the operators "/=" and "*=" reuse them. Indeed, it is necessary to utilize at least two temporary arrays when performing this computation to hold temporary results, such as Ps+Pn and Ps/(Ps+Pn).
Convolution Filtering
Convolution filtering is often used in place of frequency-domain filtering; see Example 3. In this example, the source data vector of N elements s is convoluted with filter f having M taps. Summation indices i and j are bound by the source data vector length and filter coefficient count.
Typically, a convolution filter is specifically designed for the task in hand. It can be a convolution analog of band-pass/band-stop, Wiener, and matched or pure inverse filters.
Least-Squares Filters
When the desired output signal shape w(t) (that is, the shape of the signal after filtering) is known, you can construct a least-squares convolution filter f(t) by minimizing the cost function in the form in Example 4. Consider this example: You play music through a loudspeaker. The loudspeaker transfer function is usually known and given in a chart in the owner's manual. The loudspeaker works like a linear filter with impulse response d(t); that is, an inverse Fourier transform of the transfer function. Thus, to hear undistorted music, you must compute a least-squares filter f(t) for the desired output w(t) matching delta-function, so that f(t)*d(t)=d(t). The music signal must be convoluted with f(t) before it's supplied to the speaker. Then the loudspeaker plays undistorted music (Figure 3).
Example 4 leads to a special system of linear equations that can be efficiently solved by Toeplitz recursion [2]. The implementation of Toeplitz recursion is presented in LSF.cpp (http://www.cuj.com/code/), routine Toeplitz(). The right part g[i] of the system consists of a cross-correlation of the desired output w(t) and d(t), and the matrix coefficients represent autocorrelation of the system impulse response d(t):
for ( int k = 0; k + i < m; k++ ) { g[i] += d[k]*w[k + i]; r[i] += d[k]*d[k + i]; }
Filter coefficients are calculated in M iterations, and in each iteration a new filter coefficient is produced.
Optimal-Lag Filters
Real-time processing generally requires filter length to be short. However, the quality of the output signal depends on the filter length. In general, the longer the filter, the better the output signal quality.
It was shown in [3] that simple time-shifting of the desired output w(t)=w(t-l) can significantly improve filter quality. Lag-filter is a filter calculated for the time-shifted desired output. Optimal-lag filter is a filter corresponding to the best output quality for all possible values of lag l.
For instance, if you want to build a microcontroller for music signal processing for the loudspeaker example, then the filter length would have to be fairly short since hardly any three-dollar chip will handle a 4096-tap, real-time filtering at 44.1 kHz. However, with optimal-lag filtering, you can find the best filter coefficients for a 128-tap filter, given the loudspeaker impulse response function.
Optimal-lag filter calculation is straightforward. A lag-filter is computed for a range of lag values, and the filter quality is evaluated. Lag value and filter coefficients corresponding to the best filter quality are stored.
Function P() calculates filter quality. Time-shifting of the desired output w(t) is affected by indexing of w-array: w+128-k. Constant 128, selected to limit filter length, also defines the range of meaningful lag values: l=0, 1,..127. The Toeplitz(), OptimalFilter(), and P() routines can be optimized for the desired output matching delta-function (spike). In this case, g[i] can be computed as:
memset(g, 0, sizeof(g)); for ( int i = 0; i <= l; i++ ) g[i] = d[l - i]*A; // Spike height = A
and lag-filter quality is calculated as:
for ( int i = 0; i <= l; i++ ) p += f[i]*d[l - i];
For use with the original routines, delta-functions can be specified by a single nonzero value in array w: w[0]=128, w[1]=w[2]= ...w[127]=0.
Yet another application of least-squares and optimal-lag filters is the ultrasound signal deconvolution. A map of reflections can be reconstructed from the ultrasound signal by filtering each rayline signal with an optimal-lag filter calculated for the desired output w(t) matching the delta-function and d(t) matching the ultrasound transducer impulse response.
Wavelet Transformations
No filtering toolkit is complete without wavelets. Wavelets are closely related to convolution filters and allow analyzing data according to scale. This is the core idea behind multiresolution analysis. With each analysis iteration (scale level), the source data is decomposed on low-pass and high-pass signals representing "smoothed" data and fine detail. The process is repeated recursively to produce source data representations in different complementary subspaces [4]. An overview of wavelets can be found in [5].
Figure 4 illustrates the wavelet analysis algorithm (also called "Forward Wavelet Transform"). With each iteration the source data is convoluted with high-pass g[i] and low-pass h[i] analysis filters. In the process of convolution, data is also decimated, so the output of each filter has two times less samples than the input.
The source data for the next iteration is the output fh(n) of the low-pass filter from the previous iteration. The analysis process ends when g[i] and h[i] filters produce a single output value.
An implementation of convolution filtering for wavelet analysis is illustrated by matrix multiplication (Figure 5). High-pass and low-pass matrices have two times fewer rows than columns. Each line contains filter coefficients translated by 2 to achieve downsampling by 2. Notice that filter coefficients wrap around in the first rows of the low-pass matrix and in the last rows of the high-pass matrix. This makes convolution filtering periodic, because the data is filtered as if it represents a periodic sequence of samples.
At each step of analysis, the dimensions of the matrices shrink by 2. The entire process of wavelet analysis can be expressed as a sequence of matrix products:
fh(i)=H(i)fh(i-1) w(i)=G(i)fh(i-1), i = 1, 2..n fh(0) = f, f has 2<sup>n</sup> samples, fh(i) has 2<sup>i</sup> samples. H(i) and G(i) are 2<sup>n-1</sup> x 2<sup>n</sup>.
Efficient implementation of wavelet analysis (forward wavelet transformation) operates according to Mallat's pyramidal algorithm and is given in FFT.cpp, routine FWT() (http://www.cuj.com/code/). The algorithm involves programming of convolution operations with careful indexing that ensures correct source data indices wrap around. The main convolution/decimation loop originally looked like:
for ( int i = 0; i < N/2; i++ ) { float S1 = 0.0f, S2 = 0.0f; for ( int j = 0, k = k0; j < M; j++, k++ ) { int k = (i*2+j)%N; S1 += s[k - N]*g[j]; // G S2 += s[-1 - k]*h[j]; // H } // Store N-level wavelet coefficients *w++ = S1; // Store low-pass filtered downsampled // signal s[N/2 - 1 - i] = S2; }
where k=(i*2+j)%N ensures wraparound N in s-array indices. Notice that low-pass filtered source data is stored in the source array under the offset s[N], N=n, n+n/2, n+n/2+n/4,...2n-1. Thus, you must allocate 2n elements for s. After the call to FWT(), the second half of the array s contains smoothed and downsampled source data for all n scale levels.
During the optimization process, mod-operations (%) were replaced by more efficient index calculations:
k++; ... if ( k >= N ) k = 0;
The last statement normally compiles into a conditional MOVE instruction (CMOVGE in x86 instruction set) and does not involve branching and associated branch-misprediction penalties.
A 1024-point, 6-tap forward wavelet transform implemented according to this algorithm takes less than 350 s to execute on 300-MHz Pentium II with 512-KB L2 cache.
Inverse Wavelet Transformation
Wavelet synthesis operation, often referred to as "Inverse Wavelet Transform," is similar to forward wavelet transform but involves transposed matrices H and G. Moreoever, inverse wavelet transform can use different low-pass and high-pass synthesis filters h[i] and g[i]. Different analysis/synthesis filters are typically used with biorthogonal wavelets [1]. Figure 6 illustrates the inverse wavelet transform.
Wavelet coefficients wn are convoluted with the high-pass synthesis filter g*[i] and upsampled. Smoothed data fh(n) is convoluted with the low-pas synthesis filter h*[i] and also upsampled. Two resulting data vectors are then added together and make smoothed data fh(n-1) for the next synthesis step. Upsampling results from the structure of filter matrices H* and G* (Figure 7) that have twice as many rows as columns, with each column containing filter coefficients translated by 2. Typically, matrices H* and G* are transposed matrices H and G, unless biorthogonal wavelets are used.
Efficient implementation inverse wavelet transform is shown in FFT.cpp, routine iFWT(). It also involves periodic convolution. The given implementation utilizes a trick that lets you avoid complex index calculations of row-by-column matrix multiplication. Instead, column-by-column multiplication/addition is used. This is equivalent to data vector convolution with filter coefficients in a "vertical" manner. The results of high-pass and low-pass vertical convolutions are then accumulated together.
The wavelet synthesis process utilizes the second half of output array s for storing intermediate smoothed results. Thus the s-array must be two times larger than the array of wavelet coefficients w, or contain 2n elements.
Matrices H and G in Figure 5 do not have 1/2 in front of them because the implementation of wavelet transformation discussed in this article uses normalized wavelet coefficients. It means that each h[i] and g[i] is premultiplied by 1/sqrt(2). Thus for Haar transform h[0]= h[1]=1/sqrt(2) and -g[0]=g[1]=1/sqrt(2) rather than all 1s. Table 1 contains normalized wavelet coefficients for commonly used D4 and D6 wavelets.
Forward and inverse wavelet transformations are usually used together. Typical applications consist of an analysis step, wavelet coefficient modification, and a synthesis step [5]. For noise/speckle reduction or compression, wavelet coefficients are either discarded, thresholded, or quantized. The key to such an algorithm's success is a search for best possible analysis/ synthesis filters. An entropy-based best basis searching technique is presented in [4] along with some useful pseudocode routines related to linear filtering.
Conclusion
The linear filtering techniques presented here form the bases of a digital-signal processing toolkit. The routines are optimized for speed. It took 640 s to execute 1024-point FFT() on a 300-MHz Pentium II machine with 512 KB L2 cache. However, a 1024-point, 6-tap FWT() only took 300 s to execute. In short, these signal-processing routines are an efficient and portable implementation of key linear filtering algorithms.
Finally, for performance-critical applications it is better to use compilers that utilize vectorization techniques to enable SIMD processing. Intel C++ compiler (http://www.intel.com/software/products/compilers/cwin/) and Metrowerks CodeWarrior C/C++ compiler (http://www.metrowerks.com/MW/Develop/CodeWarrior) use such vectorization techniques to generate Streaming SIMD and 3DNow! instructions for optimal floating-point performance. However, vectorization of source code targeted for execution on Streaming SIMD/3DNow!-enabled processors is possible only when 32-bit floating-point numbers are used. This limitation is due to the Streaming SIMD/3DNow! instruction set ability to perform up to four parallel floating-point operations on 32-bit single-precision numbers only.
References
[1] Castleman, K. Digital Image Processing, Prentice Hall, 1996.
[2] Robinson, E. and S. Treitel. Geophysical Signal Analysis, Prentice-Hall, 1980, pp. 146-149, 163-169.
[3] Fomitchev, M., Y. Grigorashvily, and S. Volkov. "Ultrasonic Pulse Shaping with Optimal Lag Filters," International Journal of Imaging Systems and Technology, vol. 5, 1999.
[4] Wickerhauser, M. Adapted Wavelet Analysis from Theory to Software, IEEE Press, 1993.
[5] Fomitchev, M. "An Introduction to Wavelets and Wavelet Transforms," Mathematical Morphology, Smolensk, vol. 11, 1998 (http://www.smolensk.ru/user/sgma/MMORPH/N-4-html/1.htm).
[6] Fomitchev, M. "Opening GUI Windows from Console Applications," Visual C++ Developer Journal, vol. 11, 1999.