Real Applications
A number of applications written using the RapidMind platform illustrate the performance gains. One such example is RTT AG (www.rtt.ag), a provider of automotive visualization software. RTT's RealTrace software was built using RapidMind, and is the world's first workstation real-time raytracer. It lets accurate reflections and refractions be displayed on interactive models; see Figure 3. By leveraging GPUs to perform the complicated computations involved in raytracing, RTT was able to put a product on the market that surpasses the state-of-the-art in raytracing performance. The same code was ported to the Cell BE processor within three weeks, and shown as a demonstration in IBM's SIGGRAPH 2006 booth.
With quad-core CPUs now available (and larger numbers of cores on the horizon), we are preparing for the pervasiveness of multicore systems. To this end, we recently demonstrated a prototype of a RapidMind x86 multicore backend on a financial modeling application; see Figure 4. Compared to hand-tuned C code using the Intel C++ Compiler, the RapidMind version of the algorithm was able to achieve twice the performance on a single core, scaling to eight times the performance on four cores, with no additional application effort. The increased performance on a single core may come as a surprise. It stems from the fact that the semantics of the platform let our code generators perform much better analysis and optimization of the application code. We have been careful to design the system so that we are not plagued by issues (such as pointer aliasing) that inhibit effective optimization on C or C++. At the same time, our system integrates so cleanly with C++ that all the modularity constructs of the language are available to the developer, generally at no performance cost.
Likewise, Hewlett-Packard and RapidMind performed comparisons between CPUs and GPUs for a scientific algorithms, such as the Fast Fourier Transform (FFT) and the Basic Linear Algebra Subroutines (BLAS) single-precision matrix-multiply (SGEMM). The results showed that the RapidMind implementations on a GPU were between 2.4 and 32.2 times as fast as the same algorithm running on a CPU core. For the CPU comparisons, extremely tuned numerical libraries were used.