Dr. Dobb's | Managing Multi-Core Projects, Part 1: HPC on the Parallelism Frontier

Managing Multi-Core Projects, Part 1: HPC on the Parallelism Frontier

HPC may be on the leading edge, but key advice like going parallel early, thinking strategically, and spreading knowledge throughout the team applies to all development managers. This is Part 1 of a 6-part series on managing multi-core development projects.

December 16, 2008
URL:http://drdobbs.com/parallel/managing-multi-core-projects-part-1-hpc/212500765

HPC was parallel when parallel wasn't cool. In high-performance computing, where developers have long experience with parallel computing and where large clusters are often the target platform, the multi-core driven concurrency revolution isn't catching anyone by surprise. From these pioneers, we can learn that parallelism makes a competitive difference -- and that it doesn't happen overnight.

In HPC, as in other software, all signs point to increasing parallelism as the surest path to improved system performance and to competitive advantage. Just when you were warming up to programming SMP within the nodes of a cluster, asymmetry between types of core processors will add additional complexity. But whatever the architecture of the next generation system, whether SMP (symmetric multiprocessing) systems based on multi-core processors, FPGA supercomputers, hybrid GPU designs, or other asymmetric configurations, all can be approached with some proven principles for managing parallel software projects.

Parallelism is a defining feature of HPC, as are large data sets and long run times, sometimes measured in days or weeks. Typical HPC applications divide the dataset among multiple processors, achieving parallelism through data decomposition. Data decomposition can be an effective technique in games and video applications as well.

HPC platforms may be threaded, shared-memory systems, or they may rely on message passing for communication and coordination among large collections of more independent nodes. Both concurrency techniques may be used together. Although not as performance sensitive, multithreaded enterprise applications face similar architectural complexity in terms of program correctness.

Threaded SMP systems enjoy a tremendous bandwidth and latency advantage over distributed memory systems, but scaling is limited. Multi-core processors raise the scaling limit of SMP, increasing its applicability to parallel programming problems, whether in HPC or in the enterprise.

Starting Points

Revisions in HPC applications are frequently driven by a user requirement for greater capacity (the ability to handle larger datasets). That may mean more threading to improve performance. In addition, look for ways to apply parallelism to add new capability (the ability to solve new problems with additional resources). The new prevalence of multi-core will make it easier to find third-party parallel components that help in this regard. Both added capacity and added capability are desirable, but while the former keeps you ahead of your competition, the latter can put you in a whole new market.

For development managers, the challenge is not so much to introduce parallelism as to plan development approaches that continually target scalability. That challenge has to be met not only at the start of a new project, but through successive upgrades.

If you're planning the next version of an application, add additional parallelism along with other changes. One approach is to concentrate on the new modules that implement new features, making sure these make the best use of parallelism. For this to be effective, you need to make sure that new code is sufficiently isolated from old code. This won't always be possible, but it's a further argument for walling off new features in separate modules. By keeping new functions modular, you can aggressively add new parallel code while limiting its impact on existing code and limiting the scope of required regression testing. Modularity is a desirable goal unto itself. Because the interfaces between systems are so sharply defined, message-passing systems tend to be more modular than threaded programs.

Amdahl's law is a well-known principal that describes the benefit you can expect from moving portions of a project from serial to parallel processing. The more direct approach to better performance is to follow where Amdahl's law leads and to go after serial regions in existing code. Attacking this problem requires a thorough performance analysis, which historically has meant reading through code, but automated tools can improve the process by increasing coverage. Intel Performance Analyzer's call graph profiling can help here.

Other tools can help to measure the performance of large cluster systems. For applications using MPI (Message Passing Interface), Intel Trace Collector and Intel Trace Analyzer can analyze performance on cluster systems of over 1,000 processors.

There's no working around a bad design, but that doesn't mean that a good design can't be improved by some of the same tools and techniques that you might apply to legacy code. Testing modules for performance as well as correctness is important before introducing the complexity of a fully integrated system. Intel Thread Checker has unique capabilities for debugging threaded applications that are useful at this stage. For performance analysis, Intel Thread Profiler can compare threaded performance of several versions.

Real-World Conditions

There's no way to cover every case with simple rules. What might be a pragmatic solution that keeps a project on schedule might, under different conditions, be a shortsighted fix that hampers long-term performance. Tactical judgment needs to be applied in a strategic context, and there's no substitute for experience in developing that judgment.

Bob Kuhn, Intel's Technical Marketing Director for Advanced Parallel Software Platforms, is a parallel-computing expert and a veteran of many a parallel programming development effort. Kuhn says that many HPC projects at first sought to increase performance by optimizing away the current bottleneck, using the easiest mechanism, then attacking the next bottleneck that cropped up in a similar fashion. "For pragmatists," says Kuhn, "that may provide sufficient performance."

But Kuhn cautions that such an approach has a point of diminishing returns—what he terms the "project manager's version of Amdahl's law." Eventually, the most egregious bottlenecks are eliminated, and each successive target of optimization delivers a lower marginal performance benefit for the same amount of development resources.

Kuhn describes a more sustainable approach. "Analyzing the goal with Amdahl's Law, start by saying everything in the application must eventually be in the parallel region to reach your goal," he says. "What data structures must be parallel and without synchronization?" Improvements along these lines may show lesser short-term speedups per developer-hour, but they have greater prospect for long-term gain, with the benefit coming from data decomposition.

Optimization often makes an application structurally more complex, making it harder to improve overall parallelization after optimizations have been made. "After many changes, you find you have to do much more to switch to data decomposition," says Kuhn.

On the other hand, according to Kuhn, sometimes you have to consider options other than data decomposition, even in HPC. For example, a workflow model might be a more practical first-pass way to quickly integrate third-party programs in your HPC application than a deep parallel integration. In this case, the clean stdin/stdout interface of a workflow approach avoids a number of bugs that would surely crop up in a shared-memory integration of two large, complex pieces of code from different sources.

The Dream Team

Part of the planning phase should be an evaluation of the skills of project team members. It's important to have familiarity with parallel programming throughout the team, from those responsible for the initial design to those providing field support.

In the ideal case, a "dream team" would deliver applications that delight the user with new functionality, scale up to the number of cores on the newest processors, all on schedule with market availability. In building that dream team, you'll get better results teaching an engineer with domain knowledge the principles of parallelism than training a computer scientist in what your users expect. Work toward a team with these skills:

Parallelism Architect: Needs expertise in parallelism, with experience in parallel algorithm design, and deep knowledge of your application. Grow one of your application experts through classroom training on principles, best practices, and tools.
Computational Scientist: An individual that combines scientific domain knowledge with parallel-computing expertise. Bring in one of this new breed to discover how the latest parallel methods and structures can increase your functional and performance advantage.
Application Developers: Developers should be trained in the principles of parallel processing, know parallel tools, and be able to build thread-safe component interfaces.
Test Engineers: Test engineers should be quite strong in parallel debugging skills and familiar with parallel analysis and profiling tools. Key adversarial testing skills in parallelism vary not only the number of threads, but the order in which they execute.
Field Support Engineers: These engineers need some parallel debugging skills and should also have knowledge of parallel tools. In highly scalable software, your customer may have a larger cluster than you can afford at headquarters. Field engineers need parallel application skills to work with these customers.

Of course, reality rarely reflects the ideal case, but the critical point here is that all team members need to have experience with parallel computing.

Grow three types of knowledge: knowledge of what your users want in each function, horizontal knowledge of the architecture in clean synchronizing interfaces, and vertical knowledge in building and using thread-safe components. Don't have a developer that's added parallelism in one function move on to adding parallelism in another. The function's developer knows the code best and will have the best intuition on what is and what is not parallel.

Finally, whether you're developing an HPC application or productivity software, the fundamental things apply: Target some features where you can add parallelism today. Think strategically about parallelism in the whole application. And develop parallel skills in every member of the development team. In the next installment of this series, we'll move beyond the planning phase and explore the management issues surrounding implementation, test, and debug of parallel HPC systems.

Steve Apiki is senior developer at Appropriate Solutions, Inc., a Peterborough, NH consulting firm that builds server-based software solutions for a wide variety of platforms using an equally wide variety of tools. Steve has been writing about software and technology for over 15 years.