The I/O portion of the Standard C++ library was designed to be extensible: it exposes what, in other libraries, would be internal components. You don't need to know anything about the library's architecture to write std::cout << "Hello world", but, for more complicated tasks, you can extend or modify almost every aspect of the standard I/O library. We've already seen several ways of extending the library's behavior. Locale facets provide another way.
Using Locales
A locale in C++ is an object of class std::locale. Every stream object has its own associated locale. You can get a copy of a stream's locale by writing std::cout.getloc(), and, if L is a locale object, you can tell a stream to use L by writing std::cout.imbue(L). Note that telling std::cout to use L has no effect on other streams' locales.
A different version of locales was originally introduced into ANSI C to support internationalization and localization, and locales continue to be used for those purposes in the Standard C++ library. The simplest way to use locales is to use the ones that the system already knows about. Simplest of all is the "classic" locale, returned by the static member function std::locale::classic. The "classic" locale is a basic locale that's suitable for simple English text; it corresponds to the way that the pre-ANSI C library behaved, and to ANSI C's "C" locale. By default, all of the standard stream objects use the "classic" locale.
You can also construct a locale object given a name. On IRIX, for example, a German locale is named "de". Here's a simple test program that shows how to construct a locale by name:
#include <locale> #include <iostream> int main() { std::locale L("de"); std::cout::imbue(L); std::cout << 3.14 << std::endl; }
If you're on a system where "de" is a German locale [1], then, when you compile and run the above program, the output should be "3,14". The C++ I/O library delegates most of its formatting decisions to locale facets, so, when you tell cout to use a German locale, cout will use German conventions for formatting numbers.
This is all very similar to the C library's setlocale function. To use preexisting locales at the level of setlocale, there are just two more things to know both of which involve constructors. First, many operating systems allow users to specify a "preferred" locale. If you create a locale whose name is the empty string, as in std::locale L(""), it's supposed to be that preferred locale. (Exactly what it means to specify a preferred locale, of course, depends on your implementation.) Second, std::locale has a default constructor. You get to decide what the default locale is: if you write std::locale::global(L), then, from that point on, any default-constructed locale will be a copy of L.
It's often useful to put these two features together. If you put the following code fragment at the beginning of your program, it will create a locale that represents the user's preferences, set it to be the default, and tell the predefined stream objects to use that locale. (You can't omit that last step, because setting a new default doesn't affect whatever locales already exist.)
std::locale L(""); std::locale::global(L); std::cin.imbue(L); std::cout.imbue(L); std::cerr.imbue(L); std::clog.imbue(L);
Locales and Their Facets
What exactly is a locale object? You might reasonably assume that std::locale encapsulates data needed for localization and internationalization. That's not completely wrong, but it's not completely right either. True, std::locale does have a few localization-related member functions: if L is a locale, you can compare two strings by writing L(s1, s2), you can test characters with std::isupper(c, L), and so on. But functions like that are just frills, and std::locale would do just as well without them.
The surprising answer is that std::locale is a type-safe heterogeneous container class nothing more. Locales have very little to do with localization; all of the real work is performed by the objects contained in locales. Those objects are called facets.
Let's unpack that phrase. Heterogeneous means that the elements contained in an std::locale are of different types. (Contrast that to a homogeneous container like std::vector<int>, where every element is of type int.) A type-safe container is one where type information isn't lost: when you look at an element in the container, it has the same type as it did when you put it in. (Contrast that to a framework where all of the elements appear to be of, say, type Object*, and where you have to downcast an element to do anything with it.)
There are a number of ways to write a type-safe heterogeneous container class. The method used in std::locale involves three further restrictions. First, once a locale has been constructed, there is no way to add new elements. Second, a locale can contain at most one element of a particular type. The type serves as a sort of index: you can write std::has_facet<T>(L) to ask a locale L if it has an element of type T, and, if it does, you can use std::use_facet<T>(L) to get a reference to that element. (If a locale has no element of type T, then std::use_facet<T>(L) will throw an exception.) Third, any type that you put in a locale object must inherit from the base class std::locale::facet and must contain a static member variable, called id, whose type is std::locale::id.
These restrictions make it clear that std::locale doesn't have to work magic. The system uses the static id to associate every facet class with a unique integer index, and, because every facet inherits from std::locale::facet, a locale can internally store its facets in something like a vector<std::locale::facet*>. All std::use_facet<T>(L) has to do is find T's index and look in the slot that corresponds to that index. If there's an object in that slot, then it must be safe for the system to cast that object to a const T& [2]. So long as you obey the restrictions in the last paragraph, however, you don't have to worry about the precise implementation details.
This mechanism works well with inheritance and polymorphism. Internally, std::locale maintains pointers to facets. Externally, when you write std::use_facet<T>(L), you get a reference. Everything works just the same if you're dealing, not with T itself, but with one of T's subclasses. Accordingly, most facet classes use virtual member functions. You can think of such facets as interfaces, designed for extension by inheritance.
So what does all this machinery have to do with localization? Very little. It's all in the facets! The standard facets, which are contained in every locale object, include std::ctype<char> (which deals with character classification for char), std::ctype<wchar_t> (which does the same for wide characters), and std::numpunct<char> and std::numpunct<wchar_t> (which contain information for numeric formatting). These facets are used by the I/O portion of the Standard C++ library; other facets, like std::collate<char> (which provides string comparison and hashing), are supplied so you can use them in your own programs. In both cases, however, it's the facets contained in an std::locale object, not std::locale itself, that contain the information needed for localization.
The Standard also defines some classes that inherit from those facets and that are constructed by name. That's what std::locale("de") really means: you're getting a locale where the facet objects are std::ctype_byname<char>("de"), std::numpunct_byname<char>("de"), and so on. The fragment
typedef std::numpunct<char> NP; const NP& np = std::use_facet<NP>(L); char point = np.decimal_point();
invokes one of numpunct's virtual member functions [3], and that member function is overridden in numpunct_byname.
Overriding a Facet
C++ locales were designed to be extensible in two ways: by deriving a new class from an existing facet, or by adding new kinds of facets. The Standard C++ library has already given us examples of the first method, the *_byname facets.
Suppose, for example, that you want to change the way that the library classifies characters. That's not so unlikely as it might seem, because the I/O library uses character classification for parsing. The I/O library normally skips whitespace when it performs formatted input, and, when you're reading such objects as strings, the library stops when it sees whitespace. By default the whitespace characters are space, tab, newline, and a few less common control characters. And, of course, that behavior isn't always appropriate: in some applications '\t' is the only character that you want to treat as a delimiter, and ' ' should be treated as an ordinary character; in other applications, you might want to treat a character like '_' the same way as ' '.
Character classification is handled by the std::ctype<char> facet. You can change the way that the I/O library classifies characters, by defining a class that inherits from std::ctype<char> and then using a locale where your derived class is substituted for the ctype base class.
The first step is to write the derived class. In this particular case, all of the work is in the constructor, rather than in overridden virtual member functions, because the C++ Standard mandates that std::ctype<char> uses a lookup table for character classification. (This is a special optimization for char. It wouldn't work for wide characters, because in general the lookup table would be too large.) The constructor simply makes a copy of the "classic" table and modifies table entries as appropriate. Here, for example, is a ctype subclass where the space character is no longer whitespace.
class my_ctype : public std::ctype<char> { public: my_ctype(size_t refs = 0); private: mask my_table[table_size]; }; my_ctype::my_ctype(size_t refs) : std::ctype<char>(my_table, false, refs) { std::copy(classic_table(), classic_table() + table_size, my_table); my_table[' '] = (mask) (print | punct); }
There are a few things to notice about my_ctype.
First, we're using names that we inherit from the base class std::ctype<char>: classic_table (a static member function that returns a pointer to the beginning of the "classic" locale's lookup table), table_size (the number of elements in the table), mask (an enum type), and two of mask's enumerators, print and punct. Each character's entry in the lookup table is a value of type mask.
We're also inheriting the static member variable id from the base class, instead of defining id ourselves. That's important. We want the system to put my_ctype in the same slot that it would normally use for std::ctype<char>, so the two must use the same id.
Second, we're setting the character classification table to my_table, instead of the default table, by passing my_table as an argument to the base class's constructor. The second base class constructor argument, false, is related: it tells the base class that it doesn't need to delete the table that we're passing to it in the first argument. If we had been using dynamically allocated memory for the table, then we could have written true instead, in which case the base class's destructor would have invoked delete[] on the table.
Third, we've given our constructor an argument, refs, that has no apparent purpose, and we're passing it to the base class. The refs argument is related to a fact about C++ locales that I haven't mentioned, but that you might have already guessed on your own: facets that are installed into locales are reference counted. If you think about how locales are implemented, you'll see that locales almost have to use reference counting. After all, std::locale has a copy constructor; what should std::locale L2(L1) do with L1's facets? The only reasonable answer and the answer that the C++ Standard gives is that L2 should use the same facet objects as L1.
The implementation keeps track of how many locales are using a particular facet object. Usually, once no locales are using a facet object anymore, the system destroys that facet. However, the library gives you a choice: you can tell it not to destroy a facet, even when that facet isn't being used anymore. The base class std::locale::facet has a constructor that takes an argument of type size_t. If that argument is zero the default then reference counting works the usual way. If the argument is nonzero, then the system won't ever destroy the facet. All of the standard facets, including std::ctype<char>, have a constructor that takes a refs argument and passes that argument to the facet base class.
We didn't have to define a constructor that took a refs argument: we could have just decided on a definite memory allocation policy. But giving users the choice of memory allocation policy takes no extra effort, so there's no reason not to do it. (You might wonder why you'd ever want to have a facet object that doesn't get destroyed automatically. Again, it's because you don't have to use dynamic memory allocation: it's occasionally convenient to have a facet that's defined as a file-scope variable.)
Once we've defined my_ctype, the next step is to install it into a locale. There's no member function that will let us add a facet to an existing locale, but there's a constructor that lets us do the next best thing: create a copy of an existing locale that uses a new facet. Here we're creating a locale L that's just like the "classic" locale except that it uses my_ctype instead of the default std::ctype<char>:
std::locale L(std::locale::classic(), new my_ctype);
(Since facets are reference-counted, we can create a my_ctype with new and not worry about deleting that object ourselves.)
Finally, now that we have a locale that uses our new facet, we can install that locale into a stream object. Here's a small test program that shows what happens:
#include <iostream> #include <string> #include <locale> #include <algorithm> #include <sstream> class my_ctype : public std::ctype<char> { public: my_ctype(std::size_t refs = 0); private: mask my_table[table_size]; }; my_ctype::my_ctype(std::size_t refs) : std::ctype<char>(my_table, false, refs) { std::copy(classic_table(), classic_table() + table_size, my_table); my_table[' '] = (mask) (print | punct); } int main() { std::istringstream in("one two\tthree four"); std::locale L(std::locale::classic(), new my_ctype); in.imbue(L); std::string s; while (in >> s) std::cout << s << std::endl; }
When you compile and run this program, the output will be
one two three four
The tab character is still treated as a delimiter, but, because of the new ctype facet, the space characters are not.
Conclusion
A facet can be as simple, or as complicated, as you need it to be. There are only two requirements: a facet must inherit from the base class std::locale::facet, and it must have a public static member variable id, either directly or by inheritance, whose type is std::locale::id. So, for example, the following class is a complete facet. (We don't need to declare a virtual destructor, since we're inheriting one from std::locale::facet.)
class msg : public std::locale::facet { private: std::string str; public: static std::locale::id id; msg(const std::string& s, std::size_t refs = 0) : std::locale::facet(refs), str(s) { } std::string get() const { return str; } }; std::locale::id msg::id;
This simple msg facet might even sometimes be useful: since every stream object has a locale, msg allows you to associate a unique string tag with every stream. You could write a manipulator to access that tag:
std::ostream& tag(std::ostream& os) { std::string s = std::has_facet<msg>(os.getloc()) ? std::use_facet<msg>(os.getloc()).get() : std::string("*DEFAULT*"); os << s; return os; }
There are no real limits on how sophisticated a facet can be. You might reasonably define a facet to represent time zones, to represent the different ways of displaying personal names, or to format the ubiquitous Employee class that shows up in so many C++ examples. The output operator itself, operator<<, could become a wrapper function that finds the appropriate facet and then delegate all of the real work to that facet. Here's what the I/O delegation pattern looks like:
class Employee_fmt : public std::locale::facet { protected: ... // Formatting information public: static std::locale::id id; Employee_fmt(std::size_t refs = 0); virtual std::istream& get(std::istream&, Employee&) const; virtual std::ostream& put(std::ostream&, const Employee&) const; }; std::istream& operator>>(std::istream& is, Employee& e) { std::locale L = is.getloc(); const Employee_fmt& fmt = std::has_facet<Employee_fmt>(L) ? std::use_facet<Employee_fmt>(L) : Employee_fmt(); return fmt.get(is, e); } std::ostream& operator<<(std::ostream& os, const Employee& e) { std::locale L = os.getloc(); const Employee_fmt& fmt = std::has_facet<Employee_fmt>(L) ? std::use_facet<Employee_fmt>(L) : Employee_fmt(); return fmt.put(os, e); }
This is an example of what Andrew Koenig calls the "fundamental theorem of software engineering": Every problem can be solved with an extra level of indirection. The Employee_fmt facet can use some reasonable formatting defaults, but users who need fine control over formatting can define their own derived classes that inherit from Employee_fmt just as my_ctype inherits from std::ctype<char>. The insertion and extraction operations will transparently use the derived class.
Numeric formatting in the Standard C++ library uses this same I/O delegation pattern: for numeric I/O, operator<< and operator>> are wrappers that pass their arguments to the std::num_put and std::num_get facets. (The exact details are slightly different for standard numeric I/O than for our Employee example, but only slightly: num_get and num_put operate on streambuf iterators instead of on streams. This mainly affects the details of error handling.)
At this point, you might be wondering how to reconcile the use of facets with my advice from an earlier column [4]. When should you define a facet, and when should you instead use iword/pword and xalloc to define a format flag?
If you're writing an input or output operator whose behavior can be parameterized with a single boolean flag or a single integer argument, then you should use iword; that's what it's for. If there are multiple choices, or if there's a choice that's too complicated to describe in a straightforward manner as an integer or a boolean, or if you don't think you can foresee all of the possible choices that users might want in the future, then you should use a facet. Naive users can ignore facets, and sophisticated users can subclass and replace facets as needed.
My advice for when to use pword: never. You can use pword to associate a stream with a string or some other arbitrary object, but I can't think of any circumstances in which it would be a good idea. If you're dealing with such a complicated problem that you'd be tempted to use pword, you should define a facet instead. Our msg facet is far simpler than the equivalent pword-based solution, and far easier to extend.
Locales, and their facets, are fundamental to I/O in the Standard C++ library. The library uses facets to decide which characters are whitespace, to format numbers, and to convert characters between their internal and external encodings. It provides facets that format money and time and date. The Standard documents exactly how and when the facets are used; by design, you can change the behavior of the library by replacing a facet. Locales are an extremely general mechanism, however, and the standard facets are just examples. Whenever you have customizable data, and whenever you deal with formatting in a way that you think a user might need to change, you should consider expressing it as a facet.
Notes and References
[1] Alas, locale names aren't standardized. If you're using Borland C++ on Windows, for example, you'll need to say "German" instead of "de". You should check your system's documentation for available locale names.
[2] For more information about implementing std::locale see Nathan Myers, "The Standard C++ Locale," Dr. Dobbs Journal, August 1998.
[3] That virtual member function isn't actually decimal_point itself, but a protected member function called do_decimal_point. All of the standard facets use that style: they have protected virtual member functions, which are called by public nonvirtual member functions.
[4] Matt Austern. "The Standard Librarian: User-Defined Format Flags," C/C++ Users Journal C++ Experts Forum, February 2001 http://www.cuj.com/experts/1902/austern.htm.
Matt Austern is the author of Generic Programming and the STL and the chair of the C++ standardization committees library working group. He works at AT&T Labs Research and can be contacted at [email protected].