Matthew Wilson is a software-development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004) and Extended STL (Addison-Wesley, 2006). He can be contacted at http://imperfectcplusplus.com/.
[This month's column is, in part, an extract from Matthew's forthcoming book on STL extension, called Extended STL, which will be published by Addison-Wesley in 2006.]
This month I'm continuing the theme of how other languages may influence the idioms of C++. Where the last installment examined how Ruby's flexible subscripting might be emulated in C++, in this installment and the next I'm going to be considering how D's [1] notion of a slice can be applied in C++, in particular to strings. I'll be discussing, briefly, the "view" concept and its refined conceptthe slice view. I'll then introduce the STLSoft basic_string_view class template, and discuss its implementation and contracts. Finally, I'll discuss the contexts in which the use of string views is appropriate, along with those where it's not.
The View Concept
The concept of a view is best known from the field of relational databases, where a view is an alternative way of looking at data from one or more tables, with the following characteristics:
- A view is not a copy of the data from the source tables; indeed creation of a view does not involve creation of any data whatsoever.
- Changes to the data in the underlying table are reflected in the data apparent in the view.
- Where mutating access is allowed, changes to view data are concurrently reflected in the underlying table.
The three primary purposes of such views are:
- To present the information from two or more tables as if from a single table: Amalgamation.
- To restrict the available information; that is, to hide the contents of a particular table's column: filtering.
- To effect changes in the view into the underlying table, and vice versa: (logical) coherency.
There has been much interest in adapting the view concept to C++, including the work of Jon Seymour [2], the View Template Library (VTL) by Gary Powell and Martin Weiser [3], and the work of Maciej Sobczak [4]. My own work on views inhabits a lower level; its primary purpose is the raison d'etre of C++: efficiency. The additional purposes of views in C++ are legion, including:
- To present the collection and/or the data as a different type: Type Adaptation.
- To modify the values in the data as presented: Transformation.
- To present the data in a different order: Reordering.
- To avoid unnecessary copies while presenting a different perspective onto the data: (physical) coherency.
Views can have a bewildering variety of forms and functions, including aggregating, filtering, linearizing, polymorphic, sorting, reversing, restructuring, and transforming, to name but a few. For these two installments, I'm going to focus on only the slice view.
Slices in D (and in C/C++)
In D, a slice provides a view onto a section of an array. Consider the following code:
import std.stdio; void main() { char[] s = "This is a null-terminated string".dup; char[] sl = s [0 .. 4]; writef("[%s]\n", sl); // Prints "[This]" s[2 .. 4] = "at"; writef("[%s]\n", sl); // Prints "[That]" }
A slice is declared as type[], and is internally represented as a length plus a pointer. Hence, the creation of the slice sl does not involve a copy of the source string s, it merely "remembers" the pointer to the start of the literal string (&s[0]) and the length 4. When the contents of s are changed in the assignment of the string "at", the changes are reflected in any subsequent use of the slice sl. (Note that s is a duplicate of the literal string, which, as in C++, is read-only. An attempt to write to a [slice of a] literal will result in an access violation.) The C++ equivalent of a D slice is:
template <typename T> struct slice { size_t len; T *ptr; };
(A slice could also be represented as a pair of pointers. Although I've dallied with such a representationin the recls librariesI consider it to be an inferior representation to a length+pointer. This is because one tends to want to know the size more often than the "end" position. The adopted way also affords direct compatibility with D, and makes writing extensions for Ruby and other languages a little bit easier.)
Obviously, the previous template isn't exactly bristling with functionality, nor does it facilitate nonmutating (that is, const) access; something that the D language avoids by not supporting const as we know it.
Slice Views
To make such a type useful for C++ would require adding methods, such as constructors, assignment operators, and accessors; and providing support for iteration. Such a type exists in the STLSoft libraries, in the form of the array_view class template, which presents a (nonmutating) std::vector-like interface. There is also a more complex class template, basic_string_view, which provides more functionality to support a (nonmutating) std::basic_string-like interface, and which is the subject of this month's and next month's columns.
Slice views can be very simple. They consist of constructors, begin() and end(), rbegin() and rend(), a subscript operator, size(), and, importantly, base(), which returns the slice view's base pointer.
Slice View Advantages
The two main advantages of slice views are logical coherency and physical coherency. Logical coherency means that views always reflect the up-to-date contents of the underlying collection.
The latter advantage, physical coherency, translates into greater efficiency. It has two aspects: First, there is a reduction in memory usage because there are fewer, if any, copies made. Second, because slices don't do deep copies, the cost of copying slice instances is very low. We'll see demonstrations of these efficiency advantages in several different contexts in the second installment.
Further to these main advantages, the view-class templates, by virtue of their emulation of standard interfaces, also provide collection type adaptation. In other words, using basic_string_view allows slices of character arrays to be used in place of std::basic_string.
The slice view (also known as array view) is the only view concept that guarantees (and relies on) contiguous storage. This is analogous to the situation with std::vector, which is the only standard container that guarantees contiguous storage. As a consequence, slice views (and vectors) are compatible with C APIs.
Views Are Not Value Types
The fact that slice views are always up to date with respect to their underlying storage is, as we will see in the next installment, a great boon, but it also carries risks, and imposes limitations on their use. Simply put, a slice view, unlike the standard library containers (including basic_string), is not a value type. In Imperfect C++ [5] I define a value type as including these aspects:
- "Instances can be created as, or later made to be, copies of another instance."
- "Change[s] to the logical state of one instance does not result in a change to the logical state of [any of its copies]."
Slice views fulfill the first of these criteria, but not the second, because if one view is made as a copy of another, and the underlying storage changes, then the logical states (that is, the contents) of both views are affected: They are not independent.
Thus, one must be careful with views to not use them when the link to their underlying storage is, or could be, broken. For example, if one takes a string view onto the contents of a mutable container, such as a std::string, the view can be invalidated by mutating members in just the same way that an iterator can be:
std::string s("Short string"); stlsoft::string_view sv(s.data(), s.size()); std::string::iterator it = s.begin(); s = "Longer string"; // sv and it both invalidated
Obviously, holding a view onto a string after that string is deleted is pretty fatal. Therefore, string views are best used in contexts where their view nature is irrelevant (temporaries/single expressions) or in contexts where their nature is overt (split functions). We'll investigate these contexts in the next installment. Not being a value type also has implications for the string-view semantics, so let's now look at the definition of the class.
STLSoft's basic_string_view
The STLSoft basic_string_view class template provides an interface that corresponds to the standard basic_string to a significant degree, as shown in Listing 1. The notable features of basic_string_view are:
- It has a nonmutating (const) interface, except for copy assignment, clear(), and swap().
- It supports a rich set of constructors, à la basic_string, except for the range template constructors (because it must always be pointing to contiguous storage).
- It supports the rich set of comparison operators offered by basic_string.
- It provides all the usual basic_string element access, size, and iteration methods.
- It does not support the myriad search methods of basic_string.
As well as the length (m_length) and pointer (m_base) member variables that constitute the slice, there is a third member, m_cstr. This is for the implementation of the c_str() method, which will be discussed shortly. Given the simplicity of the slice view concept, the bulk of the implementation of the class is very straightforward, and I won't belabor you with the details: The constructors, copy assignment operator, swap(), size(), max_size(), length(), compare(), front(), back(), begin(), end(), rbegin(), rend(), and copy() methods all have exactly the same semantics as they do in basic_string. The seven constructors are all eminently straightforward: The default constructor sets the slice members m_length to 0 and m_base to NULL; the other constructors initialize the members as appropriate to their arguments. The m_cstr member is set to NULL in all cases. The destructor need do nothing with the slice members, but releases any memory buffer pointed to by m_cstr.
Because slice views, by definition, present information that they do not themselves manage, they sit somewhat uncomfortably on the border of the value-type domain. Hence, the other methods have interesting, and sometimes surprising, semantics: The implementations of capacity(), clear(), refresh(), equal(), c_str(), base(), data(), is_valid(), and empty_string_() will be examined in the remainder of the article, including an examination of the contract programming aspects of the class.
capacity()
capacity() returns the same values as size() and length(). This means that code that might test capacity() against length() will (always) receive a meaningful result of 0 of the spare capacity in the string instance.
clear() and refresh()
The clear() and refresh() methods are closely related. Several standard containers provide the clear() method, which means "empty out all contained elements." For the string view, this means setting the pointer to NULL and the length to 0. But it also means deallocating any memory used. The refresh() method leaves the logical state of the string view untouched, and performs only the latter action: resetting the pointer. We'll see why this is so shortly.
equal()
equal() is provided as an optimization because it can immediately return true if the internal pointers and lengths are identical, or immediately return false if the lengths differ.
base()
base() is a method from the slice view conceptthat returns the pointer part of the slice. Unless modified by clear(), swap(), or operator =(), whatever was passed to the constructor will be returned by this method.
c_str()
Because the string view may look onto a slice that does not end in a null terminator, it's obvious that c_str() cannot simply return m_base. Further, even if, at the time of construction, the element of the source string at m_base + m_length is a null terminatorwhich we can't test for anyway, because it might be in an uncommitted page of memory!the underlying storage can change at any time, without notification of the string view(s) mapping to it.
So, if we are to provide c_str(), we must allocate a buffer of length 1 + m_length (into m_cstr), and copy in the contents and add a null terminator. Naturally, once we've done this, it's possible that the underlying storage will change and render the allocated copy out of date. There are six options to handle this:
- Don't provide c_str(). Several of my reviewers commented firmly that this is the best (indeed the only sensible) option. But, while many of the uses to which the string views are most suited do not need c_str(), this is a nonstarter: There's simply too much code requiring C-style (that is, null terminated) strings for it to be practicable. I know; I've tried.
- Create the copy during construction, which persists for the lifetime of the string-view instance. Doing this would have drastic performance implications: It would likely be slower than using basic_string, or another full value-type string type. It would also mean that the copy, once stale, remains in that state.
- Create the copy the first time c_str() is called, which persists for the lifetime of the string view instance. This pays the cost only when necessary, but still risks stale data.
- (Re)create the copy each time c_str() is called. This means that the contents of the buffer are fresh, but it would have unacceptable semantics. Consider the code in Listing 2. This is a strong, but hidden, break from the semantics of basic_string. Too strong. Clause 21.3.6;2 of the Standard stipulates that "The program shall not...treat the returned value as a valid pointer after any subsequent call to a nonconst member function," and a string view's c_str() should offer similar guarantees.
- Create the copy the first time c_str() is called, and then update the contents of the (same) buffer with each subsequent call. This is especially suitable because string views have a fixed sizeexcept when reset with operator =(), clear(), and swap(). However, it would lead to situations in which Listing 3 might not hold to be true.
- Allocate the copy the first time c_str() is called and provide a methodrefresh()to reset the m_cstr member so that the next time c_str() is called the copy is made afresh. This avoids hidden semantic gotchas, without enforcing staleness of data. It's not perfect, but I believe it is the best mix of performance, usability, and sensible semantics, and is the approach I've used in basic_ string_view. Listing 4 shows the implementation of the c_str(), operator =(), refresh(), and empty_string_() methods. empty_string_() is a private static method that is used to provide an empty string for the given character type without needing a separate implementation file. c_str() returns either m_cstr (if previously allocated) or empty_string_() (if the instance is empty), or allocates a copy into m_cstr and returns it.
Unlike std::basic_string, there are no assign() methods. Only two methods exist for modifying a view after its construction: copy assignment via the copy-assignment operator, and swapping via swap(). It would be easy to implement a nearly full-complement of efficient assignment functions in terms of the constructors, but I deliberately did not do this; as it stands, the interface limits the ease with which the user can forget that string views are not value types.
base() and data() And the String/View Tension
Although it might not seem obvious, data() is one of the most important methods of the string view. The Standard prescribes, for basic_string, that data() return a nonNULL pointer to the first character in an array representing the contents of the string. Note that the pointer returned by data() does not have to be null terminated; this is, unlike c_str(), which must return a nonNULL pointer to a null terminated array.
The obvious implementation for data() would be simply to return m_base, but there are two reasons why we can't do this. First, it's common for functions that take length and pointer to ignore the pointer if the length is 0. Similarly, the behavior of the string-view constructor is well defined if the pointer is NULL. One appealing implementation option would be to initialize m_base to be the empty_string_() in the case where the size specified to a constructor is 0. However, slice views need to be able to provide, via base() and size(), the same description of the slice passed to the constructor, supporting cases (a) and (b) in Figure 1. By assigning m_base to empty_string_() when size is 0case (c) in Figure 1the required slice semantics would not be supported. Thus, the implementation of basic_string_view::data() is as shown in Listing 5.
This represents a slight performance cost for each access, but cannot be avoided, given the semantic constraints already outlined. If you don't care about NULL or are confident you won't receive it for a given string view instance, call base() instead.
is_valid()
Due to their somewhat schizophrenic nature, the specification of contracts for view classes must be subject to very careful consideration, to account for the fact that their logical contents are subject to external change. Interestingly, with a string view, this factor leads to very simple contracts. The invariant for the classrealized, as is my wont [5] in the is_valid() method (Listing 6)stipulates two obvious relationships: that the m_base pointer must be valid if length is non-0; and that the m_cstr member (used to implement c_str()) must be NULL if the length is 0.
What may be surprising, however, is that there is no association between the length and the contents of the underlying array. There are two reasons. First, just as with std::basic_string, we wish to allow strings with embedded NULL ('\0') characters. Second, because it's a view, it's eminently possible (and sometimes expressly desired) that the contents of the underlying array will change. It would be both dissuasively inefficient and logically flawed to police the view contents; hence, the invariant is so simple.
Example Uses of String Views
As mentioned earlier, a slice view instance is not valid outside the lifetime of its underlying array. String views are no exception, but because they are small, and look (and for the most part smell) like strings that are value types, one must be careful when using them to avoid the dead reference problem. Despite this, there are several excellent uses for them.
One such use is with the Open-RJ library, which uses stlsoft::basic_string_view<char> as its string type. Because an Open-RJ database is a single, immutable block of memory, within which reside the names and values of all fields in the database records, it would be wasteful indeed to take copies of these strings (for example, in the form of std::string) when slices will safely suffice. All such slices will be valid as long as the database is loaded, irrespective of the lifetime of the record and/or field objects from which they are elicited.
Another scenario in which the dead-reference problem is moot, and in which string views are a flat-out winner, is when manipulating collections of strings with algorithms, as in the definition of has_token_string() (whose performance will be featured in next month's column):
bool has_token_string(char const *str, char delim, char const *token) { typedef stlsoft::string_view string_t; typedef stlsoft::string_tokeniser<string_t, char> tokeniser_t; tokeniser_t tokens(str, delim); return tokens.end() != std::find( tokens.begin(), tokens.end(), token); }
When string_t is std::string (or stlsoft::simple_string), there is one visit to the heap for the copy of str, and one for each tokenized element in str; for example, if str is "cpp;c;java;;pl;;" (and delim is the semicolon) there would be five allocations. However, when string_t is stlsoft::string_view (a typedef for stlsoft::basic_string_view<char>), there are 0 allocations, regardless of the contents of str! Tests show that this function works between 2 and 10 times faster than with either std::string or stlsoft::simple_string.
Another example, which will be performance tested next time, is in combination with memory mapping, to present a text file as a string object:
platformstl::memory_mapped_file mmf("mmf.cpp"); stlsoft::string_view contents (static_cast<char*>(mmf.contents()), mmf.size()); . . . code that uses whole file as string 'contents'
Coming in Part II
Next time, I'll look at some other safe, but occasionally exotic, uses of string views, including in the implementation of Basic-like slice functions. I'll also discuss a number of performance tests that demonstrate the significant gains to be had by the appropriate use of string views.
Acknowledgments
Thanks to Bjorn Karlsson, Christopher Diggins, Garth Lancaster, Nevin Liber, Pablo Aguilar, and Walter Bright for their insights into the article.
References
- The D Programming Language, Digital Mars; http://www.digitalmars.com/d/.
- Viewsa C++ Standard Template Library extension, Jon Seymour, 1995-6; http://www.zeta.org.au/~jon/STL/views/doc/ views.html.
- View Template Library, Gary Powell and Martin Weiser, 2000. http://www.zib.de/weiser/vtl/.
- Sobzcak, Maciej. "STL Sequences & the View Concept," C/C++ Users Journal, April 2004.
- Wilson, Matthew. Imperfect C++, Addison-Wesley, 2005.
CUJ