More Resources |
Intel officially released the source code of Threading Building Blocks (TBB) during OSCON 2007. Federico Biancuzzi interviewed the Chief Evangelist for Intel's Software Development Products to learn more about the project.
Could you introduce yourself?
James Reinders: I'm Intel's Chief Evangelist for Intel's Software Development Products, and Director of Sales and Marketing (for Intel Software Development Products). Another way to look at it is I'm an engineer who joined Intel in 1989 'cause I thought it would be a cool place to work for a few years. I joined working on parallel supercomputers, and now I get to work on multi-core which brings parallelism to everyone. Not a bad deal.
What are Intel Threading Building Blocks?
James Reinders: Extensions for C / C++ for parallel programming. Most importantthese extensions offer an abstraction which removes the programmer from thread managementand that is very important. These extensions work with all compilers, because it is implemented as a template library.
Why did you choose to use templates?
James Reinders: We wanted to work with every C++ compiler immediately. C++ templates offer the perfect method to do thisimmediate and still very efficient due to the strong support for generic programming that C++ offers. Extensions, such as OpenMP, take years to be universally available. In the case of OpenMP, it was about a decade from when we first implemented before it was available from all the popular compilers. We didn't want to wait ten years, and we didn't see a need to wait.
What type of projects would get the biggest advantages using TBB? I am thinking about existing codebase, where maybe they are already using native functions to manage concurrency, or maybe they are not taking advantage of multi-core at all.
James Reinders: C or C++ programs with parallelism which is not just simple loop-oriented data-parallel parallelism. If you have loop-oriented data-parallel parts in a program (C or Fortran)use OpenMP if possible. Otherwise, when you have a little more complex program in C or C++, you'll find TBB offers a great deal of flexibility that makes many forms of parallelism easy to represent. OpenMP cannot offer that.
How much time can developers save using TBB?
James Reinders: They save time four waysimplementation, debugging, tuning and updating for the future. What we actually see is TBB makes the difference between the problem being approachable, and not getting done at all because it is not approachable.
That said, I've seen accomplishments by programmers new to parallelism in a day which I doubt the same programmer would have gotten done in two weeks if they had studied hard and worked each day. And that's just considering the work to write, debug and tune the application to be equivalent. It ignores the "future proofing" which an abstraction (like TBB) offersbecause you can assume that TBB will evolve to support new hardware with little or no effort for applications using it. Whereas, handwritten code will need to be rewritten as the hardware changes in ways originally not anticipated.
How would you debug software that is using TBB?
James Reinders: You can reasonably expect to get your job done with current tools because TBB leads you to programs more likely to 'just work' than less abstract ways of implementing parallelism. However, I would strongly recommend some debugging tools and tuning tools to help.
Does Intel provide any special tool?
James Reinders: Yes, I recommend the Intel Thread Checker (to directly pinpoint potential deadlock and potential race conditions) and Intel VTune Performance Analyzer (with the Intel Thread Profiler included with VTune) for performance tuning. TBB has an extra option to be used with a few extra hooks for the checking tool which makes the task even better.
Since Intel is also a major CPU creator, is there any secret that TBB is using to provide better performance?
James Reinders: I assume you mean "secrets connected to using the hardware".
The thing closest to the hardware is the very careful implementation of locks and atomic operations. Getting those right means knowing the best way to do them for a particular piece of hardware. If that interests you, then look at the implementation in the source for x86, x86-64, Itanium, or G5 that are already coded up there.
But outside that very low level aspectTBB's power is higher level constructs which don't lean on a precise hardware coding method. The coolest things are probably: (a) task stealing, (b) coding to be cache invariant, (c) scalable memory allocation. If these are of interest, I speak to them some in my book on TBB along with references to further reading... and the source code is there to look at too.
I was reading the FAQ and here I found that "Sun and Intel are working together to support the Solaris platform using Sun Studio software. This contribution is expected during the latter half of 2007." Any news?
James Reinders: The current build works for Solaris on x86. Sun is looking at building with their compilers (instead of gcc) and supporting SPARC as well. The effort is on track to have binaries posted this year, hopefully sooner than later. It's a bit of a 'time available' effortcommunity fashionso I don't have a fixed date. I know there aren't any issues other than finding time to finish it up and post it.
Intel CPUs share a big L2 cache among the cores, while AMD ones have a smaller L2 cache for each core. Does TBB handle this too? How?
James Reinders: TBB works to use cache invariant algorithmswhich means algorithms which automatically fit into the cache available by design (not by computing size and adjusting). Our math libraries are designed the same way.
The other thing is the load balancing in TBB handles the difference in performance different tasks will have, and juggling to keep all processors busy. This is really the bigger one, I think, for TBB because cache variations make static partitioning of a problem unlikely to be optimal everywhereso dynamic adjustments are very important.
What type of resources does TBB require? It cuts down the time spent to write the software, but how much does it weigh on performance and memory usage?
James Reinders: Minimal impact on footprint (the libraries are not very large), and the only 'overhead' on performance would come from the 'dynamic' manner of coding parallelism so TBB can do load balancing. Most generally, this is a performance win which can't be eliminated. However, if a static schedule will work for you and you have a lot of processing to do (very coarse grain)the overhead of TBB can be noticeablebut I think we're still talking 10-20 percent at most. Such programs might be better written in OpenMP using the static scheduling directive. I've seen some people port an OpenMP application where they used static scheduling with OpenMP, and then complain about slowdowns. This is not the right test for TBBsince TBB can code many, many problems which OpenMP cannot. We made TBB co-exist with OpenMP and hand coded threads, so you can use each if that is your preference even in a single application.