It's easy to imagine the reaction of seasoned C++ programmers upon seeing yet another string class implementation: "So why do we need another string class? We had too many in the pre-standard C++ Library world, and now that we finally have std::string
, why not just use it?"
I agree, but only if the standard implementation doesn't introduce significant performance problems. In this article, I show how the standard string implementation isn't always good enough, particularly for multithreaded applications that create lots of temporary strings. That said, it is worth noting that there's no such thing as a standard string implementationthey all differ. (For examples of the different implementations, see item 15 in Scott Meyers's Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library, Addison-Wesley, 2001.) Here, I analyze one of the more popular string implementationsthe one that comes with g++ 2.96 on Red Hat Linux 7.x. To this end, I create a number of multithreaded tests showing the Standard C++ string implementation (which comes with g++ 2.96 on Red Hat Linux 7.3) can have real performance problems for multithreaded applications that need to create lots of unrelated temporary string objects. To address these performance problems, I present a string class that gives you the speed and efficiency of automatic C strings, plus the convenience of C++ string objects that are STL friendly.
std::string
Class Design
All string implementationsincluding g++ 2.96usually share the same design principles:
- Memory for the string value is always allocated on the heap. (Okay, Scott Meyers describes one implementation that doesn't allocate memory for the very short strings.) Think about it. You have a string object on the stack, but there is always a heap allocation for the value of the string unless it is initialized through a copy constructor and it's reference counted (which generates other problems).
- String values are reference counted; so if you assign string
s1
tos2,
their internal pointers point to the same location in memory. It's apparently done to make copy constructors and assignment operators very fast. They don't need to allocate memory and copy the value, just set the pointer and increment a reference counter. The same logic makes a destructor very fast. In most cases, it just decrements the reference counter. - Closely related to reference counting is copy-on-write (COW) optimization that allocates/copies when one of the string objects pointing to the same value changes.
The first item is usually a bad idea when you have lots of temporary strings as automatic variables in a multithreaded application. The heap is shared between threads, so some locking is inevitable; otherwise, the heap's internal data is corrupted. This approach defeats the idea of the fast non-blocking creation/access of the stack variables in multithreaded applications (every thread has its own nonshared stack). It looks deceptive to programmers, too. For instance, in Example 1, mystring
is an automatic (stack) variable with no indication that memory for "Some C value" is allocated on the heap (unfortunately, that is exactly what happens).
int myfunc() { std::string mystring ( "Some C value" ); ... }
Example 1: This defeats the notion of fast, nonblocking creation/access of stack variables in multithreaded applications.
Likewise, the second and third items may sound good until you realize they're a bad fit for multithreaded applications. Value and reference counters are shared between multiple string variables and potentially between multiple threads that absolutely require access serialization (locking) to the reference counter. Atomic increment/decrement/check can be used if these operations are supported by the CPU's command set; see "Optimizations that Aren't (In a Multithreaded World)," by Herb Sutter, C/C++ Users Journal, June 1999.
Listing One (teststring1.cpp) creates 100,000 temporary std::string
objects on the stack (most of the memory is allocated on the heap). It has only one thread running, but was built as a multithreaded application, so it does all the required locking (refer to the Makefile, available at the top of this article). I ran this and subsequent tests on a 2-GHz Dell desktop with 1 GB of RAM under Red Hat Linux 7.3. Example 2(a) shows the results.
Listing One
#include #include "util.h" unsigned int createStrings ( int n ) { static char *samples[]={"something1","something2","something3","something4"}; printf ("Thread <%d>: Creating/destroying %d strings...\n",pthread_self(),n); unsigned int start = utilGetTickCount (); for ( int i=0; i: total time=%d millisecs, average=%f\n", pthread_self(), end-start, (float)(end-start)/n ); return ( end - start ); } int main ( int argc, char **argv ) { if ( argc > 2 ) { printf ( "teststring1 \n" ); return 0; } int n = 100000; if ( argc > 1 ) n = atoi ( argv[1] ); createStrings ( n ); exit(0); }
(a) Thread <1024>: Creating/destroying 100000 strings... Thread <1024>: total time=108 millisecs, average=0.001080 (b) Thread <1026>: Creating/destroying 100000 strings... Thread <2051>: Creating/destroying 100000 strings... Thread <3076>: Creating/destroying 100000 strings... Thread <4101>: Creating/destroying 100000 strings... Thread <5126>: Creating/destroying 100000 strings... Thread <6151>: Creating/destroying 100000 strings... Thread <7176>: Creating/destroying 100000 strings... Thread <8201>: Creating/destroying 100000 strings... Thread <9226>: Creating/destroying 100000 strings... Thread <10251>: Creating/destroying 100000 strings... Thread <1026>: total time=7804 millisecs, average=0.078040 Thread <2051>: total time=8749 millisecs, average=0.087490 Thread <10251>: total time=10455 millisecs, average=0.104550 Thread <6151>: total time=10643 millisecs, average=0.106430 Thread <8201>: total time=10718 millisecs, average=0.107180 Thread <9226>: total time=10724 millisecs, average=0.107240 Thread <7176>: total time=10755 millisecs, average=0.107550 Thread <5126>: total time=10760 millisecs, average=0.107600 Thread <4101>: total time=10765 millisecs, average=0.107650 Thread <3076>: total time=10867 millisecs, average=0.108670
Example 2: (a) The results of running teststring1.cpp; (b) results of running teststring2.cpp.
Listing Two (teststring2.cpp) is the same test, but with 10 threads runningeach creating/destroying 100,000 automatic string objects. (Again, every one of them allocates memory on the heap and employs some locking and copying.) Example 2(b) presents the less-than-encouraging results of the teststring2.cpp test. With just 10 threads, average automatic string object handling (creation/destruction) increased approximately 100 times (10000/108). Since these results seem unacceptable for high-load multithreaded applications, I decided to find out where time is being spent, then create a new string class that doesn't have those deficiencies. Possible performance culprits include locking introduced by the reference counting optimization, heap allocations (and it's locking) and copying of the value.
Listing Two
#include #include "util.h" void *createStrings ( void *n_void ) { int n = (int) n_void; static char *samples[]={"something1","something2","something3","something4"}; printf ("Thread <%d>: Creating/destroying %d strings...\n",pthread_self(),n); unsigned int start = utilGetTickCount (); for ( int i=0; i: total time=%d millisecs, average=%f\n", pthread_self(), end-start, (float)(end-start)/n ); return NULL; } int main ( int argc, char **argv ) { if ( argc > 3 ) { printf ( "teststring1 [] []\n" ); return 0; } int n = 100000; int n_thr = 10; if ( argc > 1 ) n = atoi ( argv[1] ); if ( argc > 2 ) n_thr = atoi ( argv[2] ); pthread_attr_t attr; pthread_attr_init ( &attr ); pthread_t *tid = new pthread_t[n_thr]; int i; // create threads for ( i=0; i\n", i ); exit(-1); } } // wait for all threads to terminate for ( i=0; i