Our FFT, illustrated in Figure 3, utilizes a 32-bit stream input, a 32-bit stream output, and two clocks, allowing the FFT to be clocked at a different rate than the embedded processor with which it communicates. The algorithm itself is described using relatively straightforward, hardware-independent C code, with some minor C-level optimizations for increased parallelism and performance.

Figure 3. The FFT includes a 32-bit stream input, a 32-bit stream output, and two clocks, allowing the FFT to be clocked at a different rate than the embedded processor.
The FFT is a divide and conquer algorithm that is most easily expressed recursively. Of course, recursion is not possible on the FPGA, so the algorithm must be implemented using iteration instead. In fact, almost all software implementations are written iteratively (using a loop) for efficiency. Once the algorithm has been implemented as a loop, we are able to enable the automatic pipelining capabilities of the Impulse compiler.
Pipelining introduces a potentially high degree of parallelism in the generated logic, allowing us to achieve the best possible throughput. Our radix-4 FFT algorithm on 256 samples requires approximately 3,000 multiplications and 6,000 additions. Nonetheless, using the pipelining feature of Impulse C, we were able to generate hardware to compute the FFT in just 263 clock cycles.
We then integrated the resulting FFT hardware processing core into an embedded Linux (Clinux) application running on the Xilinx MicroBlaze soft-processor core. MicroBlaze Clinux is a free Linux-variant operating system ported at the University of Queensland and commercially supported by PetaLogix.
The software side of the application running under the control of the operating system interacts with the FFT through data streams to send and receive data, and to initialize the hardware process. The streams themselves are defined using abstract communication methods provided in the Impulse C libraries. These stream communication functions include functions for opening and closing data streams and reading and writing those streams. Other functions allow the size (width and depth) of the streams to be defined.
By using these functions on both the software and hardware sides of the application, it is easy to create applications in which hardware/software communication is abstracted through a software API. The Impulse compiler generates appropriate FIFO buffers and Fast Simplex Link (FSL) interconnections for the target platform, thereby saving you from the low-level hardware design that would otherwise be needed.
Embedded Linux Integration
The default Impulse C tool flow targets a standalone MicroBlaze software system. In some applications, however, a fully featured operating system like μClinux is required. Advantages of embedded Linux include a familiar development environment (applications may be prototyped on desktop Linux machines), a feature-rich set of networking and file storage capabilities, a tremendous array of existing software, and no per-unit distribution royalties.
The μClinux (pronounced "you-see-Linux") operating system is a port of the open-source Linux version 2.4. The μClinux kernel is a compact operating system appropriate for a wide variety of 32-bit, non-memory management unit (MMU) processor cores. μClinux supports a huge range of microprocessor architectures, including the Xilinx MicroBlaze processor, and is deployed in millions of consumer and industrial embedded systems worldwide.
Integrating an Impulse C hardware core into μClinux is straightforward; the Impulse tools include support for μClinux and can generate the required hardware/software interfaces automatically, as well as generate a makefile and associated software libraries to implement the streaming and other functions mentioned previously. Using the Xilinx FSL hardware interface, combined with a freely available generic FSL device driver in the MicroBlaze μClinux kernel, makes the process of connecting the software application to the Impulse C hardware accelerator relatively easy.