Comparative Benchmarks and Conclusions
Definition of the GFFT object factory means instantiation of all needed GFFT template classes during compilation. Each of them includes O(P)
recursive template classes. A reasonable question: how long will the compilation take? GNU C++ 4.x was used to compile performance tests shown in Figures 1 and 2, and took about 10 seconds with optimization keys for a full range of values P in the object factory, because the overall number of different templates to be instantiated stays small. Microsoft Visual C++ 2003 and Intel C++ 8.x and later versions showed similar results, although not all C++ compilers can treat the recursive templates of the GFFT so efficiently. I suppose some older compilers try to instantiate template classes for each member function call even if the template classes are the same. There exist O(2^P)
such member function calls in the GFFT. As a result, the compilation hangs and finally fails, after entire memory has been exhausted. Those unlucky examples are IBM C++ 6 (Visual Age) and GNU C++ 3.x, but not GNU C++ 2.96! They could compile the GFFT object factory without optimization options quite fast, but hung with optimization turned on.
A nice property of the GFFT that distinguishes it from other well known FFT libraries is that a runtime computation of roots of unity using sine and cosine functions is not needed any more. Usual FFT implementations of known libraries like FFTW or Intel MKL proceed in two steps:
The second step can be repeated for different data of the same length. The GFFT omits the first step without any additional penalty. Figure 2 shows benchmarks of the GFFT compared to FFTW with options FFTW_ESTIMATE and FFTW_MEASURE as well as routine zfft1dc from Intel MKL 7.0, where total CPU time of both steps was measured. Being very good for small P, the performance of FFTW become poor for large P. The only hardware optimized Intel MKL has a similar high performance comparing to the GFFT. Qualitatively the same benchmark results were obtained on Itanium2 processor with Intel C++ compiler.
Finally, I would like to make a general conclusion as to the differences between the GFFT and traditional implementation. What made the GFFT so efficient? The essential step was the assumption of P to be static constant. The reason was not simply to play with template metaprogramming, but to give additional information to the compiler, so that it could optimize the code better. Really, many numerical algorithms have some integer parameters, which vary in a small range (like P in FFT). It can be assumed static and should be used as far as possible in template metaprogramming, but not to overload the compiler taking care about the total number of template classes to be instantiated. The final implementation can be more efficient, since the compiler has got more information and so an opportunity to optimize the code. You can use such an implementation originally with static parameter or compile it for all possible or needed parameter values within object factory and use as a library. The latter means you can choose highly optimized pieces of code without dynamic bindings at runtime.
This article described a C++ implementation technology on example of simple classical Cooley-Tukey FFT algorithm. Of course, there are many advanced and more complicated FFT algorithms, such as Winograd FFT, parallelized FFT, and the like. Those modern FFT approaches can also take advantages of template recursion and metaprogramming to improve performance significantly.