In this article, we explore code acceleration and techniques for code conversion to hardware coprocessors. We also demonstrate the process for making trade-off decisions with benchmark data through an actual image-rendering case study involving an auxiliary processor unit (APU)-based technique. The design uses an immersed PowerPC implemented in a platform FPGA.
The value of a coprocessor
A coprocessor is a processing element that is used alongside a primary processing unit to offload computations normally performed by the primary processing unit. Typically, the coprocessor function implemented in hardware replaces several software instructions. Code acceleration is thus achieved by both reducing multiple code instructions to a single instruction as well as the direct implementation of the instruction in hardware.
The most frequently used coprocessor is the floating-point unit (FPU), the only common coprocessor that is tightly coupled to the CPU. There are no general-purpose libraries of coprocessors. Even if there were, it is still difficult to readily couple a coprocessor to a CPU, such as a Pentium 4.
As shown in Fig 1, the Xilinx Virtex-4 FX FPGA has one or two PowerPCs, each with an APU interface. By embedding a processor within an FPGA, you now have the opportunity to implement complete processing systems of your own design within a single chip.
1. Virtex-4 FX processor with APU interface and EMAC blocks.
The integrated PowerPC with APU interface enables a tightly coupled coprocessor that can be implemented within the FPGA. Frequency requirements and pin number limits make an external coprocessor less capable. Thus, you can now create application-specific coprocessors attached directly to the PowerPC, providing significant software acceleration. Because FPGAs are reprogrammable, you can rapidly develop and test CPU-attached coprocessor solutions.
Coprocessor connection models
Coprocessors are available in three basic forms: CPU bus connected, I/O connected, and instruction-pipeline connected. Mixed variants also exist.
- CPU Bus Connected: Processor bus-connected accelerators require the CPU to move data and send commands through a bus. Typically, a single data transaction can require many processor cycles. Data transactions can be hindered by bus arbitration and the necessity for the bus to be clocked at a fraction of the processor clock speed. A bus-connected accelerator can include a direct memory access (DMA) engine. At the cost of additional logic, the DMA engine allows a coprocessor to operate on blocks of data located on bus-connected memory, independent of the CPU.
- I/O Connection: I/O-connected accelerators are attached directly to a dedicated I/O port. Data and control are typically provided through GET or PUT functions. Lacking arbitration, reduced control complexity, and fewer attached devices, these interfaces are typically clocked faster than a processor bus. A good example of such an interface is the Xilinx Fast Simplex Link (FSL). The FSL is a simple FIFO interface that can be attached to either the Xilinx MicroBlaze soft-core processor or a Virtex-4 FX PowerPC. Data movement through the FSL has lower latency and a higher data rate than data movement through a processor bus interface.
- Instruction Pipeline Connection: Instruction-pipeline connected accelerators attach directly to the computing core of a CPU. Being coupled to the instruction pipeline, instructions not recognized by the CPU can be executed by the coprocessor. Operands, results, and status are passed directly to and from the data execution pipeline. A single operation can result in two operands being processed, with both a result and status being returned.
As a directly connected interface, the instruction-pipeline connected accelerators can be clocked faster than a processor bus. The Xilinx implementation for this type of coprocessor connection model through the APU interface demonstrates a 10x clock cycle reduction in the control and movement of data for a typical double-operand instruction. The APU controller is also connected to the data-cache controller and can perform data load/store operations through it. Thus, the APU interface is capable of moving hundreds of millions of bytes per second, approaching DMA speeds.
Either I/O-connected accelerators or instruction-pipeline-connected accelerators can be combined with bus-connected accelerators. At the cost of additional logic, you can create an accelerator that receives commands and returns status through a fast, low-latency interface while operating on blocks of data located in bus-connected memory.
The C-to-HDL tool set described in this article is capable of implementing bus-connected and I/O-connected accelerators. It is also capable of implementing an accelerator connected to the APU interface of the PowerPC. Although the APU connection is instruction-pipeline-based, the C-to-HDL tool set implements an I/O pipeline interface with a resulting behavior more typical of an I/O-connected accelerator.
FPGA / PowerPC / APU interface
FPGAs allow hardware designers to implement a complete computing system with processor, decode logic, peripherals, and coprocessors all on one chip. An FPGA can contain a few thousand to hundreds of thousands of logic cells. A processor can be implemented from the logic cells, as in the Xilinx PicoBlaze or MicroBlaze processors, or it can be one or more hard logic elements, as in the Virtex-4 FX PowerPC. The high number of logic cells enables you to implement data-processing elements that work with the processor system and are controlled or monitored by the processor.
FPGAs, being reprogrammable elements, allow you to program parts and test them at any stage during the design process. If you find a design flaw, you can immediately reprogram a part. FPGAs also allow you to implement hardware computing functions that were previously cost-prohibitive. The tight coupling of a CPU pipeline to FPGA logic, as in the Virtex-4 FX PowerPC, enables you to create high-performance software accelerators.
A block diagram showing the PowerPC, integrated APU controller, and an attached coprocessor is shown in Fig 2. Instructions from cache or memory are simultaneously presented to the CPU decoder and the APU controller. If the CPU recognizes the instruction, it is executed. If not, the APU controller or the user-created coprocessor has the opportunity to acknowledge the instruction and execute it. Optionally, one or two operands can be passed to the coprocessor and a result or status can be returned. The APU interface also supports the ability to transfer a data element with a single instruction. The data element ranges in size from one byte to four 32-bit words.
2. PowerPC, integrated APU controller, and coprocessor.
One or more coprocessors can be attached to the APU interface through a fabric coprocessor bus (FCB). Coprocessors attached to the bus range from off-the-shelf cores, such as an FPU, to user-created coprocessors. A coprocessor can connect to the FCB for control and status operations and to a processor bus, enabling direct access to memory data blocks and DMA data passing. A simplified connection scheme, such as the FSL, can also be used between the FCB and coprocessor, enabling FIFO data and control communication at the cost of some performance.
To demonstrate the performance advantage of an instruction-pipeline-connected accelerator, we first implemented a design with a processor bus-connected FPU and then with an APU/FCB-connected FPU. Table 1 summarizes the performance for a finite impulse response (FIR) filter for each case.
Table 1. Non-accelerated vs. accelerated floating-point performance.
As is reflected by the table, an FPU connected to an instruction pipeline accelerates software floating-point operations by 30X, while the APU interface provides a nearly 4X improvement over a bus-connected FPU.