Philip is the original author and technical lead for PAPI. He is a research consultant for the Innovative Computing Laboratory at the University of Tennessee, Knoxville. Nils and Per are computer scientists at the Center for Parallel Computers (PDC) at the Royal Institute of Technology in Stockholm, Sweden, and are regular contributors to the PAPI project. They can be contacted at [email protected], [email protected], and [email protected], respectively.
Open-Source Performance Analysis Tools
Contrary to widely held misconceptions, performance is portable. Virtually every optimization effort initially involves tuning for architectural features found on most nonvector processors. The code is modified to improve data reuse in the caches, reduce address misses in the TLB, and to prevent register spills to memory, all of which can cause stalls in the processor's pipeline. But the questions often arise: Why should we optimize code at all? Why not just use the best available compiler options and move on? If the code runs too slowly, why not just upgrade our hardware and let Moore's Law do the work for us? There are many answers to these questions, the simplest being that the performance of most applications depends as much on the bandwidth of the memory subsystem as the core clock speed of the processor. While we do see processor megahertz continuing to double every 18 months, DRAM speeds see 10 percent improvement in bandwidth at best. Another reason to optimize has to do with the average performance of the user's hardware. Many applications, such as an MP3 player or video conferencing package, have a certain amount of work to do in a finite amount of time. Considering that most of the user community is running computers that are at least a generation or two behind, optimizing this software guarantees that more people will be able to run your code.
In the high-performance computing (HPC) community, optimization is of critical importance as it directly translates to better science. At HPC centers, computer time is "sold" to various parties of each institution. Take, for example, the National Energy Research Scientific Computing Center (NERSC)the computer center at Lawrence Berkeley National Laboratory (http://www.lbl.gov/). The center runs a variety of large machines and sells time to each of the different divisions of the lab, as well as other government and academic institutions around the country. On these systems, 99 percent of the workload is for scientific simulation for running virtual experiments beyond physical or economical constraints. These applications range from the simulation of self-sustaining fusion reactors to climate modeling of the entire Earth down to a resolution of a few kilometers. The compute time of these models is always dependent on the resolution of the simulation. To fit within budget, experimenters must choose an appropriate resolution that resolves the phenomena of interest, yet still remains within their allocation. Optimization of these applications can result in a tremendous savings of compute time, especially when you consider that these simulations are parallel codes; there may be thousands of processors executing similar code. A 30 percent decrease in runtime could allow for a 30 percent increase in the resolution of the simulation, possibly exposing subtleties not present in the former simulation. Such a difference directly translates into accelerating the pace of scientific discovery, and even (as in the case of car crash simulations or airplane wing design) result in saved lives.
However, the hardest part of the optimization process is understanding the interaction of the system architecture, operating system, compiler, and runtime system and how that affects the performance of the application. All these elements must be taken into consideration when attempting to optimize an application. Wouldn't it be nice if there was a way to reduce the prerequisite knowledge and experience by having the processor tell you exactly where and why your code was losing performance? Well, it turns out that such a method does existon-chip performance monitoring (PM) hardware found on almost every microprocessor in use today. The PM hardware consists of a small number of registers with connections to various other parts of the chip. Traditionally, this hardware was used exclusively for testing and verification. However, now the importance of the PM hardware has become widely understood, especially among the HPC community.
There are two methods of using the PM hardwareaggregate (direct) and statistical (indirect):
- Aggregate usage involves reading the counters before and after the execution of a region of code and recording the difference. This usage model permits explicit, highly accurate, fine-grained measurements. There are two subcases of aggregate counter usage: Summation of the data from multiple executions of an instrumented location, and trace generation, where the counter values are recorded for every execution of the instrumentation.
- The second method is statistical profiling: The PM hardware is set to generate an interrupt when a performance counter reaches a preset value. This interrupt carries with it important contextual information about the state of the processor at the time of the event. Specifically, it includes the program counter (PC), the text address at which the interrupt occurred. By populating a histogram with this data, users obtain a probabilistic distribution of PM interrupt events across the address space of the application. This kind of profiling facilitates a good high-level understanding of where and why the bottlenecks are occurring. For instance, the questions, "What code is responsible for most of the cache misses?" and "Where is the branch prediction hardware performing poorly?" can quickly be answered by generating a statistical profile.
In "Optimization Techniques" (DDJ, May 2004), Tim Kientzle describes how to use the real-time cycle counter (rdtsc) found on the x86 architecture. However, there are some problems with this technique. The first is due to the nature of a system designed for multiprocessing. No system can be considered quiet when examined from a sufficiently fine granularity. For example, the laptop on which this article is being written is busy servicing interrupts from a variety of sources, as well as delivering commands to the graphics coprocessor and the PCM audio chip. The consequence of this is that it is very likely that the cycle count from rdtsc() includes a host of other unrelated events. On a desktop system, a stock Redhat 9 system with no "active" processes, the system (vmstat) reports about 120 interrupts and 50 context switches every 10 seconds. Doing cycle-by-cycle accounting of code is thus impossiblejust one interrupt, or one context switch during a measurement interval, could mislead you in your quest for bottlenecks.
The next problem with timers is more severe. Timers don't tell you anything about why the hardware is behaving the way it is. The only way to understand the system is to attempt to reconstruct the processor's pipeline as it executes your code. This is a virtually impossible task on today's advanced CPUs, even for experts with detailed knowledge of the processors microarchitecture. The last problem is one of portability and interface semantics. While the rdtsc() code segment works nicely in the Intel/Microsoft environment, it doesn't help when you migrate to another architecture, operating system, or even compiler!
These problems and more have been addressed with the development of the Performance Application Programming Interface (PAPI) library. PAPI (available at http://icl.cs.utk.edu/papi/) is intended to be completely portable from an API standpoint. That is, performance tools and instrumentation that work on one platform, seamlessly work with a recompile on another. While the interface is portable, the generated data is not. Data from the PM hardware rarely has the same semantics from vendor to vendor, and often changes from model to model. A cache miss on one platform may be measured entirely differently on another; even definitions as "simple" as an instruction can become blurred on modern architectures (consider x86 instructions versus x86 uops).
PAPI is implemented on a wide variety of architectures and operating systems. The current release, PAPI 3.0.7, is supported on the Cray T3E, X1, Sun UltraSparc/Solaris, Alpha/Tru64, IBM Power 604e,2,3,4/AIX, MIPS R10k/IRIX, IA64,IA32,x86_64/Linux, and Windows/IA32 (not P4). Wrappers for PAPI exist for C, C++, Fortran, Java, and Matlab.
To facilitate the development of portable performance tools, PAPI provides interfaces to get information about the execution environment. It also provides methods to obtain a complete listing of what PM events are available for monitoring. PAPI supports two types of events, preset and native. Preset events have a symbolic name associated with them that is the same for every processor supported by PAPI. Native events, on the other hand, provide a means to access every possible event on a particular platform, regardless of there being a predefined PAPI event name for it.
For preset events, users can issue a query to PAPI to find out if the event is present on the current platform. While the exact semantics of the event might be different from processor to processor, the name for the event is the same (for example, PAPI_TOT_CYC for elapsed CPU cycles). Native events are specific to each processor and have their own symbolic name (usually taken from the vendor's architecture manual or header file, should one exist). By their nature, native events are always present.
PAPI supports measurements per-thread; that is, each measurement only contains counts generated by the thread performing the PAPI calls. Each thread has its own PM context, and thus can use PAPI completely independently from other threads. This is achieved through the operating system, which saves/restores the performance counter registers just like the rest of the processor state at a context switch. Using a lazy save and restore scheme effectively reduces the additional instruction overhead to a few dozen cycles for the processes/threads that are using the performance monitoring hardware. Most operating systems (AIX, IRIX, Tru64, Unicos, Linux/IA64, HPUX, Solaris) have officially had this support for a significant time, motivated largely by the popularity of the PAPI library and associated tools. For Linux/IA32/x86_64, support exists in the form of a PerfCtr kernel patch (http://user.it.uu.se/~mikpe/linux/perfctr/), written by Mikael Pettersson of Uppsala University in Sweden. This patch has not yet been formally accepted in the main Linux 2.6 kernel tree. Users with concerns about stability and support should know that this patch has been installed and in use for many years at a majority of the U.S. Government's Linux clusters on the Top 500 list, not to mention numerous other computational intensive sites around the world. The notable exception to operating system support is Microsoft Windows. Due to the lack of support of preserving the state of performance counters across context switches, counting on Windows must be done system-wide, greatly reducing the usefulness of the PM for application development and tuning.
PAPI has two different library interfaces. The first is the high-level interface meant for use directly by application engineers. It consists of eight functions that make it easy to get started with PAPI. It provides start/read/stop functionality as well as quick and painless ways to get information, such as millions of floating-point operations per second (MFLOPS/S) and instructions per cycle. Listing One contains code that demonstrates a canonical performance problemtraversing memory with nonunit stride. We measure this code's performance using the PAPI high-level interface. This example uses PAPI presets. They are portable in name but might not be implemented on all platforms, and may in fact mean slightly different things on different platforms. Hardware restrictions also limit which events can be measured simultaneously. This simple example verifies that the hardware supports at least two simultaneous counters; it then starts the performance counters counting the preset events for L1 data cache misses (PAPI_L1_DCM) and the number of floating-point operations executed (PAPI_FP_OPS). Compiling and running this example on an Opteron (64-byte line size) with gcc 3.3.3 (-O -g) results in this output:
Total software flops = 41943040.000000
Total hardware flops = 42001076.000000
MFlop/s = 48.258228
L2 data cache misses is 12640205
There's almost one cache miss for every four floating-point operationsperformance is bound to be poor. Switching the for loops so that the memory is accessed sequentially instead should give us better cache behavior; the result is:
Total software flops = 41943040.000000
Total hardware flops = 42027880.000000
MFlop/s = 234.481339
L2 data cache misses is 2799387
Performance has more than quadrupled and the cache misses have been reduced by 80 percent. Here we note that the number of hardware flops is different from run to run. Using "avail" (Figure 4) helps answer this question. Note that the output from avail has been edited for brevity. The PAPI_FP_INS preset is implemented as the vendor FP_MULT_AND_ADD_PIPE metric. This metric is speculative, meaning the hardware counter includes instructions that were not retired.
The low-level interface is designed for power users and tool designers. It consists of 53 functions that range from providing information about the processor and executable to advanced features like counter multiplexing, callbacks on counter overflow, and advanced statistical profiling modes. Counter multiplexing is a useful feature in situations where the PM hardware has a very limited number of registers. This limitation can be overcome by trading accuracy and granularity of the measurements for an increase in the measured number of events. By rapidly switching the contents of the PM's control registers during the course of a user's run, the appearance of a great number of PM registers is given to users. For more on advanced features, see the accompanying text box entitled "Advanced PM Features."
While PAPI provides the necessary infrastructure, the true power of PAPI lies in the tools that use it. There are a number of excellent open-source performance analysis tools available that have been implemented using PAPI. While a complete discussion of all these tools is beyond the scope of this article, we mention a few in the accompanying text box entitled "Open-Source Performance Analysis Tools." For information about these and other tools, please refer to the PAPI web page.
PAPI's goal is to expose real hardware performance information to users. By doing so, most of the guesswork regarding the root cause of a code's performance problem can be eliminated. PAPI does not solve algorithmic design issues or diagnose inefficient parallelization. It can, however, diagnose poor usage of the available processor resourcesa problem that, before now, was largely intractable. As of now, PAPI is an ad hoc standard, taking into account the input of the tool development and performance engineering community. In the future, we hope to see PAPI evolve into a true open standard and to see the vendors ship native versions of PAPI with each new release of processor and operating system. In the meantime, with continued cooperation from industry, academia, and the research community, PAPI will continue to evolve and further drive the science of performance engineering.
DDJ
#include <stdio.h> #include <stdlib.h> #include <errno.h> #include <sys/time.h> #include "papi.h" #define MX 1024 #define NITER 20 #define MEGA 1000000 #define TOT_FLOPS (2*MX*MX*NITER) double *ad[MX]; /* Get actual CPU time in seconds */ float gettime() { return((float)PAPI_get_virt_usec()*1000000.0); } int main () { float t0, t1; int iter, i, j; int events[2] = {PAPI_L1_DCM, PAPI_FP_OPS }, ret; long_long values[2]; if (PAPI_num_counters() < 2) { fprintf(stderr, "No hardware counters here, or PAPI not supported.\n"); exit(1); } for (i = 0; i < MX; i++) { if ((ad[i] = malloc(sizeof(double)*MX)) == NULL) { fprintf(stderr,"malloc failed"\n); exit(1); } for (j = 0; j < MX; j++) { for (i = 0; i < MX; i++) { ad[i][j] = 1.0/3.0; /* Initialize the data */ } } t0 = gettime(); if ((ret = PAPI_start_counters(events, 2)) != PAPI_OK) { fprintf(stderr, "PAPI failed to start counters: %s\n", PAPI_strerror(ret)); exit(1); } for (iter = 0; iter < NITER; iter++) { for (j = 0; j < MX; j++) { for (i = 0; i < MX; i++) { ad[i][j] += ad[i][j] * 3.0; } } } if ((ret = PAPI_read_counters(values, 2)) != PAPI_OK) { fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret)); exit(1); } t1 = gettime(); printf("Total software flops = %f\n",(float)TOT_FLOPS); printf("Total hardware flops = %lld\n",(float)values[1]); printf("MFlop/s = %f\n", (float)(TOT_FLOPS/MEGA)/(t1-t0)); printf("L2 data cache misses is %lld\n", values[0]); }Back to article