Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
The Theory of Constraints (TOC) is a philosophy of continuous improvement which is defined as a framework that can be applicable to many disciplines. The theory is based on the idea that in any complex system, there is usually one aspect of that system that limits its ability to achieve its goal or optimal functioning. To achieve any significant improvement of the system, the constraint must be identified and resolved.
Software optimization, in the generic sense, is the process of improving the software by eliminating bottlenecks so that it operates more efficiently on a given hardware and uses resources optimally. In keeping with Knuth's advice, TOC can help developers to identify the bottlenecks of their code for optimization that have the most return on investment (ROI). Identifying the bottlenecks in the target application and eliminating them appropriately is the key to an efficient optimization.
There are many optimization methodologies, which help developers to answer the questions of why to optimize, what to optimize and how to optimize, and these methods aid developers to reach their performance requirements. These questions are not any different than the questions asked to improve an organization, supply chain management, or manufacturing.
In this article, I propose that TOC, which is a philosophy of management and improvement, can be also used and applied for software optimization. The focus of this article is how to identify the bottlenecks and eliminate them with the help of Intel VTune Performance Analyzer while following the framework that is described by Theory of Constraints.
Introduction to Software Optimization: Hotspots and More
Performance tuning usually focuses on reducing the time it takes to complete a well-defined workload. Performance events can be used to measure the elapsed time and therefore reducing the elapsed time of completing a workload is equivalent to reducing measured processor cycles (clockticks).
Intel VTune Performance Analyzer is a powerful tool for software developers who are interested in performance analysis. The easiest way to identify the hotspots in a given application is to sample the application with processor cycles and VTune analyzer helps the developer identify where most of the CPU cycles are spent, in addition to many other processor events1, by utilizing two profiling techniques; sampling and call graph. The sampling can be in two forms: processor event based and time based sampling. Event-based sampling (EBS) relies on the performance monitoring unit (PMU) supported by the processors. From this point forward, I refer to event-based sampling (EBS) as "sampling."
When a sampling activity used, VTune analyzer by default uses processor cycles and instructions retired to analyze the application. (Recent generations of Intel 64 and IA-32 processors feature microarchitectures using an out-of-order execution engine. They are also accompanied by an in-order front " instructions retired.") The count of cycles, also known as "clockticks," forms the fundamental basis for measuring how long a program takes to execute. The total cycle measurement is the start to finish view of total number of cycles to complete the application of interest. In typical performance tuning situations, the metric Total cycles can be measured by the event CPU_CLK_UNHALTED.CORE.
The other event used by default sampling activity is instructions retired. This event indicates the number of instructions that retired or executed completely. This does not include partially processed instructions executed due to branch mis-predictions. The ratio of clockticks (non-halted) and instructions retired is called clocks per instruction retired (CPI) and it can be good indicator of performance problems (indicator of the efficiency of instructions generated by the compiler and/or low CPU utilization). (The Intel Core microarchitecture is capable of reaching CPI as low as 0.25 in ideal situations. The greater value of CPI for a given workload indicates that there are more opportunities for code tuning to improve performance.)
The approach I use here is explained in greater details in Intel 64 and IA-32 Intel Architecture Optimization Reference Manual. In this approach, it is accurate to say that the total number of cycles an application takes is the sum of the cycles issuing μops4and the cycles not issuing μops (stalls). This formula is given with the processor event names in Formula 2.
Total Cycles = Cycles issuing μops + Cycles not issuing μops
It is worth noting that this decomposition is approximate in its nature so it should be taken as estimation.
Therefore, my focus is on three basic concepts:
- Minimizing "stalls" by reducing or optimizing long latency operations such as memory accesses, floating-point operations, etc.,
- Minimizing "non-retired" instructions by reducing the branch mis-predictions,
- Minimizing "retired" instructions.
In Intel Core microarchitecture (Figure 1), up to four μops may be retired per cycle and 6 μops6 can be executed. The results of Intel 64 and IA-32 instructions must be committed in original program order before they are retired. Reducing the number of retired instructions will improve the execution time as will reducing the stalls.
Cycles wasted due to stalls indicate sub-optimal execution; therefore identifying them will be the main focus. Stalls can be categorized as"
- Execution stalls
- Front End stalls
- Mispredicted branch pipeline flushing
- Cycles wasted executing instructions that are not retired
With the introduction of Core architecture, the approach to identify and root-cause stall cycles became easier and more systematic. Before going into deeper analysis and explanation of the events, let's introduce the Theory of Constraints.
Applications Will Perform as Fast as Their Bottlenecks: Gentle Introduction to the Theory of Constraints
TOC is a philosophy of management and improvement developed by Eliyahu M. Goldratt. As mentioned earlier, it is a process of continuous improvement which focuses on the idea that in any complex system, there is usually one aspect of that system that limits its ability to achieve its goal.
The TOC processes defined by Goldratt are used to improve the health of an organization and these processes are very similar to the ones used by software optimization methodologies.
In this theory, five steps are defined. Before these steps are applied, a goal must be articulated. In our case, this is improving the performance of our application or meeting a performance requirement. Having articulated the goal, the steps can be given as:
- Identify the constraint (the root-cause that prevents achievement of the goal)
- Decide how to exploit the constraint (make sure the constraint identified is doing what it is designed to do. i.e: execution unit is executing but not stalling)
- Subordinate and synchronize everything else to the above decisions
- Elevate the performance of the constraint (increase capacity of the constraint. i.e: taking advantage of multiple cores, parallelism)
- If in any of the above steps the constraint has shifted, go back to Step 1.
The most critical piece in these steps is identifying the constraints. The questions to ask are:
- "What to change?",
- "To What to Change?" and
- "How to Change?"
The TOC defines processes called thinking processes to answer these questions and provides a problem-solving framework based on cause and effect. The basic idea while trying to reach our goal (desired performance) is to identify the hotspots, understand the underlying cause and effect relation and eliminate the causes. If we take the four major stall categories given earlier as an example, we can say that most of the time execution stalls are the cause of performance problems (effect) and resource related stalls. While sub-optimal algorithms are the cause of execution stalls (effect).
In the rest of the article I show how you can identify the constraints preventing optimal application performance by using the TOC methodology as a guideline for software optimization on Core architecture.