Parallelism affects everything about the way you manage a development project, from the skills you need to develop in your team to the debugging tools you'll use in maintenance. In part one of this series, we focused on the up-front challenges, including staffing, planning, and design. Here we'll get into the rest of the process, looking at how parallel programming changes implementation, test, and debug. As in part one, the nominal topic here is high-performance computing. But many of the lessons learned in HPC, where parallel development experience is deepest, are fundamentallessons like using thread-safe libraries as a foundation for concurrency, or continually improving scalability over time. These apply as well to software fields where multi-core is just starting to raise the issue of effective management of parallel projects.
Multi-core changes the competitive landscape by making threading a critical source of performance advantage. In HPC, parallelism may be built in at two levelsat a low level, with threads in a shared-memory system, or at a higher level, with message passing. Shared memory systems can finesse the bandwidth restrictions and added latency of message passing, but they scale only up to the number of processors you can put in a single node. By putting a multiplier before that number, multi-core processors make threading an important source of additional parallelism and additional performance. This is true in enterprise software as well as in HPC, where more threading can be applied to improving single-transaction performance in a server farm.
Building on Libraries
Delivering fast, reliable threaded software takes working from a good design and a team with the right skills, but it also takes starting the implementation at the right place. Pre-tested threading libraries such as OpenMP and Intel's Threaded Building Blocks (TBB) can shorten both development time and time spent in debug. These libraries provide a higher-level and high-performance approach to threading and remove some of the mechanical, error-prone work that come with the territory of SMP.
OpenMP is a threading API and a set of compiler pragmas for C++ and Fortran development of shared-memory systems. Intel's recently-introduced TBB is a C++ template library optimized for Intel processors that does not require specific compiler support. Both reduce the burden of explicit data partitioning and explicit synchronization compared to working with native Windows or POSIX threads.
It's good practice to work with foundational libraries like these, OpenMP and Intel TBB for shared-memory systems, MPI libraries like Intel MPI for message-passing systems. As we'll see in some industry examples below, MPI libraries can play a critical role in overall system performance.
You might be able to further reduce development costs by incorporating additional libraries of common parallel algorithms. The growth of multi-core, and therefore of parallel programming, is beginning to make these kinds of components more plentiful. It's worth researching third-party libraries that can get you off to a faster start, or help you replace portions of homegrown code that may be under-performing or hard to maintain.
Some of the functionality you can get off the shelf includes numeric, statistical, and other math-oriented routines, and media-related functions like video coding and signal processing. The NAG libraries (from NAG) and IMSL libraries (from Visual Numerics) are broad functional libraries, which are both available in thread-safe versions. NAG offers a version that is itself internally threaded. Intel partitions its functional libraries into two packages, Intel Math Kernel Libraries (MKL) and Intel Integrated Performance Primitives (IPP). Intel's libraries are also thread-safe and internally threaded (using OpenMP).
You can, and should, verify the thread-safety of third-party libraries through your own testing. Even thread-safe libraries that are internally threaded can be can be the source of thread conflicts with the calling program if they aren't used correctly. Intel Thread Checker is an excellent tool for testing the thread safety of libraries and for finding threading problems in the system as a whole.
Test and Support
There's more to test in a shared-memory design than in a message passing system, if for no other reason than the greater coupling of SMP leads to a proliferation of test cases. Tests of highly-threaded applications should vary thread execution order and timing as well as other conditions. The right tools are critical for effective test and debug of parallel systems. Again, Intel Thread Checker may be used to isolate problems that are uncovered in testing.
In consumer software, testing can't cover the variety of customer hardware configurations, but it can cover the number of cores in a customer system. In HPC, there may be fewer configurations to test, but it's possible that a customer will deploy on more cores than can be tested on the development cluster. Scaling limitations or bugs related to interactions between a large number of nodes may only surface at a customer site.
In these cases, final testing will necessarily occur with the customer, at deployment time. This scenario makes it especially important to have field support engineers with parallel testing and debugging skills, and with experience using parallel tools.
In the Field: Examples from the Oil and Gas Industry
Let's walk through some real examples. We'll use two case studies from the oil and gas industry to illustrate some of the points we've been discussing, both in this article and in part one of this series. Each is a development effort involving Intel engineers working with industry customers.
High-performance clusters are a critical part of the array of technologies deployed in the oil and gas industry to locate oil and gas deposits. There are a variety of scientific applications that run on these clusters, using message passing, threading, or a combination of these mechanisms.
In our first example case, a seismic company sought to speed up a long-running seismic imaging application that consumed 40 percent of total compute cycles in the data center. The application ran on a cluster of several hundred nodes, using MPICH, an open-source MPI library. The company viewed even a small gain as worth a significant amount of development time, for two reasons. First, reducing the load on the cluster would speed up other software and thus the overall process. Second, since the seismic imaging application run and data interpretation could take up to a month, even a small percentage gain could mean a day or two of real time.
The company's initial plan was to replace the MPICH 2.x library with Intel MPI. This alone led to a 2x speedup, but during the phase-in of Intel MPI, engineers discovered that they could further boost performance with some additional steps. These were a faster sparse-matrix multiply routine, which made for better single-node performance, and an added data partitioning stage, which reduced bandwidth consumption. Finally, Intel engineers were also able to improve global data communication performance, an improvement that was later rolled back into Intel MPI itself.
Each of these changes was built into a library. The company then integrated the libraries into its application, making for a relatively low impact on existing code. While Intel engineers were making changes, the company's non-parallel code continued to evolve, so there was additional by-hand merging of the two code bases that needed to be done before the project was completed.
Table 1. Steps taken in each of our example projects, broken down by phase of development.
|
Seismic Imaging Speedup |
Multi-User Visualization |
Design |
(1) Use faster MPI library; (2) Add data partitioning step; (3) Improve single-node performance; (4) Improve global data communication. |
Thread each node in MPI cluster to support multiple clients. |
Implement |
Build changes as libraries, then integrate. |
No thread safe MPI libraries; use asynchronous messaging. |
Test |
Realize 4.5 - 5x speedup. |
Tools (Intel Thread Checker) help compensate for less threading experience. |
Tune |
Reduce number of libraries by rolling performance changes into MPI library. |
Bottleneck-by-bottleneck, using Intel VTune. |
The company saw an overall improvement of 4.5-5x on the seismic imaging application over the two-year span of the project. The project involved the geophysicist in charge of the application, two engineers from the seismic company, and two Intel engineers, one a computational engineer and one a library expert.
The second case wasn't directly driven by a demand for greater performance. Instead, the company added the capability to support several users on a highly-parallel visualization system that initially served a single visualization client. The system allows geophysicists to interactively explore segments of the earth's crust, running 550 MB of geological survey data through a 64-node cluster. At the start, the single visualization client communicated through a single master node on the cluster.
That arrangement underutilized the cluster. The project required threading each node in the cluster to support multiple visualization clients, and interleaving visualization tasks with computation tasks.
Combining threads and MPI proved to be a problem as the project progressed to implementation, because at the time there were no thread-safe MPI libraries. (There are now two options, Critical Soft WMPI and MPICH). The company settled on asynchronous messaging using an MPI Iprobe polling loop.
The company made heavy use of Intel Thread Checker during the debugging phase. Having a threading debug tool was important at this point because the development team was much more familiar with MPI than with multi-threaded development. Nevertheless, they were able to use Thread Checker to its fullest, even bumping up against a limitation of the tool itself in working with memory-mapped files.
The team tuned the software using a classic bottleneck-by-bottleneck approach. They made considerable use of Intel VTune in this phase.
There were no performance metrics on this project other than the goal to enable multi-users without individual users seeing any performance degradation. The project team of nine (part-time, the equivalent of 2 engineers full time), plus one or two Intel engineers completed the work to support multiple users in about six months.
Each of these projects saw some significant benefit in terms of the "four Cs:" capacity, capability, creativity, and cost of parallel programming projects. In the seismic imaging case, better performance meant improved capacity and reduced cost, both in run times and in reduced impact on the data center. In the visualization project, threading nodes to support multiple clients meant an added multi-user capability, greater opportunity for end-user creativity, and reduced cost in better cluster utilization.
HPC projects like these are where all the parallel development action has been, but multi-core is changing all that, of course. In part three, we'll start looking into managing multi-core development projects in enterprise software.