Ian is currently working as a development consultant at Symantec on Internet security tools. He can be contacted at ian_barile@ yahoo.com.
The Session Initiation Protocol (SIP) is a signaling protocol for Internet conferencing, telephony, presence, event notification, and instant messaging. I recently had to write a TCP-to-UDP proxy server for SIP that could handle 100,000 concurrent connections. The proxy server is a gateway that lets Instant Messaging clients tunnel out of corporate networks using TCP. While evaluating the requirements for the proxy server, I considered two approachesthread pooling and I/O multiplexingfor developing a scalable socket server.
Although thread pooling is simpler to develop and well documented, I/O multiplexing is a superior technology. However, few I/O multiplexing implementations have been developed due to its complexity and lack of documentation. Thread pooling limits the number of clients that can be simultaneously serviced to the number of threads in the thread pool, but I/O multiplexing handles multiple clients per worker thread. In addition, I/O multiplexing reduces the CPU time used for context switching for worker threads and the time each worker thread is I/O bound. The use of I/O multiplexing lets socket servers scale upwards of 100,000 concurrent connections.
In this article, I describe a method to implement an abstraction layer that provides a single interface for I/O multiplexing on both UNIX and Windows.
Thread Pooling versus I/O Multiplexing
Many servers are written using the thread-pooling model because of its simplicity. With thread pooling, one file descriptor is added to a worker thread for the lifetime of the connection. Using a single connection per thread lets the data buffer be local on the thread stack, simplifying buffer and state management. This reduces the development time needed to bring servers to market. Unfortunately, there are three drawbacks to thread pools:
- The number of threads an OS can create per process.
- The time it takes to switch between worker threads (context switching).
- Worker threads are I/O bound.
Limited by the 4 gigabytes of virtual address space per process, Windows can create approximately 1000-2000 threads, depending upon the size of the thread stack. Different versions of UNIX have different limitations to the number of threads that can be created. Another thread-pooling limitation is the CPU time required for context switching among large numbers of worker threads. Context switching is a "heavy" operation: The CPU time a server spends context switching reduces the CPU cycles available for the server to process I/O. Depending on the OS and hardware, the thread-pooling model reaches the point of diminishing returns around 500 concurrent worker threads. When a single thread can only process I/O from one file descriptor, it must wait until that file descriptor has completed its transaction before processing another connection. If the clients are on low-bandwidth connections, worker threads are tied up waiting to process I/O. These drawbacks limit the number of concurrent connections a socket server can handle using the thread-pooling model.
I/O multiplexing, on the other hand, enables an application to overlap its I/O processing with the completion of I/O operations. Applications manage overlapped I/O by processing socket handles (client connections) through events that are sent from the kernel to the application. These notify the application that I/O has completed. By using an event-based mechanism, each worker thread can process I/O from multiple clients while the underlying driver waits for I/O to complete.
An application's ability to process I/O from many clients per thread is preferential to having one client per worker thread. With one client per thread, context switches must occur each time the application needs to process I/O from another client. Adding multiple clients per worker thread enables a server application to handle a significantly larger number of clients, processing I/O for each client as soon as it is made ready by the OS. Each client is still I/O bound, but the threads are free to process any I/O available. The number of worker threads used to process the I/O should also be considerably smaller than the number used in the thread-pool model. A simple function to calculate the number of worker threads is worker threads= 2*n, where n is the number of CPUs in the server running the application.
Operating systems differ in their native support for I/O multiplexing and the effectiveness of each implementation:
- UNIX-based operating systems share similar support for I/O multiplexing through the use of signals, select(), and poll() APIs, and a new device /dev/poll.
- Windows supports asynchronous I/O through select(), various Windows APIs, and I/O completion ports.
- Java has native I/O multiplexing in the 1.4.1 SDK through the selector API. Unfortunately, the selector API is limited to processing 64 clients per instance of the selector class.
The most efficient mechanisms for I/O multiplexing are /dev/poll for UNIX and I/O completion ports on Windows.
Implementing I/O Multiplexing
When implementing I/O multiplexing, good design principles should be respected to enable the reuse of the library through many applications. A well-designed implementation facilitates ease of reuse in different types of applications. All logic from socket APIs and I/O processing should be handled in layers separate from the I/O multiplexing implementation. Circular buffers should be used for the input buffers because the amount of data present on each read is unknown. Circular buffers simplify reconstructing data packets. Since the application is receiving completed I/O, it is more efficient to read a stream of data (several bytes) into memory and then parse the stream than it is to read the data byte by byte.
UNIX includes several facilities to develop socket server applications that use I/O multiplexing. These include signals, select(), poll(), and /dev/poll.
The use of signals for I/O multiplexing on UNIX-based systems can lead to complicated implementations, and signals haven't always been reliable. The select() and poll() APIs are UNIX system calls that let the OS send an event that the I/O is ready to process. These APIs have severe limitations when writing scalable servers. The select() API has a hard-coded limit of 1024 file descriptors and is slow. The poll() API has no hard-coded limit, but is considerably slower than select().
/dev/poll is a new device available on Solaris 7 and some versions of Linux and is the best choice for developing applications that use I/O multiplexing on UNIX. It can handle an unlimited number of file descriptors and is considerably faster than select() or poll(). A simple implementation of /dev/poll provides a mechanism for adding, receiving, and processing event notifications. After the application has received an event indicating that the I/O is ready for processing, a read() must be called on the file descriptor to get the I/O from the kernel buffer into the applications buffers.
To use /dev/poll, you must open a handle to /dev/poll through the file open() API. To have a file descriptor be watched by /dev/poll, add the file descriptor to the pollfd structure, and write the pollfd structure to /dev/poll using the write() API. To find out if any I/O is ready to be read, call ioctl() and check the return value. Listing One demonstrates a procedural approach to get-file-receive events from /dev/poll.
To increase the response time and scalability of the servers, a handle to a /dev/poll device can be created for each worker thread. This distributes incoming clients throughout the pool of /dev/poll handles. With this approach, servers can maintain and process I/O from more than 50,000 concurrent connections. Note: Solaris limits the number of file descriptors per process using hard and and soft limits. The configuration options can be modified in /etc/system.
When including <sys/devpoll.h> in a C++ program, a compilation occurs in <sys/poll_impl.h>:
In file included from /usr/include/sys/ devpoll.h:11,
from unixioeventhandler.cpp:7:
/usr/include/sys/poll_impl.h:271: parse error before `}'
This can be corrected by moving Example 1(a) up one #endif" to before Example 1(b).
Windows supplies several APIs to develop socket server applications using I/O multiplexing. These include select(), asynchronous Winsock APIs, and I/O completion ports.
By default, the Windows select() API is set to handle only 64 file descriptors. This can be changed by editing the winsock2.h header file and recompiling your code. The asynchronous routines in the Winsock2 APIs are difficult to use and don't offer a clean event system for sharing file descriptors across multiple threads. I/O completion ports are the superior I/O multiplexing implementation on Windows.
I/O completion ports have some features that are unique to Windows. They let the system control context switching to reduce the number of context switches and allow the next available worker thread to process I/O from any client. However, the complexities of I/O completion portsalong with buffer managementcomplicate implementation.
To use I/O completion ports, you must first create a completion port with CreateIoCompletionPort() API. Each additional file descriptor is added to the completion port by calling CreateIoCompletionPort() again. After adding a file descriptor to a completion port, WSARecv() must be called on the file descriptor so that the completion port can signal that the I/O has completed. I/O completion ports block on a call to GetQueuedCompletionStatus(), signaling that the I/O is ready to be processed. Listing Two demonstrates a procedural approach to using I/O completion ports.
There are two major differences between /dev/poll and I/O completion ports. The first difference is that when using /dev/poll, I/O events are sent to the thread that contains the handle to the /dev/poll device for the file descriptor being signaled (instead of the next available thread). The second difference is in the way buffers are managed to receive input from clients. When using I/O completion ports, the data sent from the client is already in the input buffer when GetQueuedCompletionStatus() returns. When ioctl() returns on UNIX for the /dev/poll implementation, a read() on the file descriptor must be called to put data in the input buffer.
When testing scalable socket servers on Windows, you have to modify maxuserport and tcpnumconnections and reboot the system to be able to handle the maximum number of connections that Windows allows. The maximum value for both of these keys is 65,535 (or 0xffffe). Before modifying these keys, you can open approximately 3500 ephemeral ports; afterwards, Windows is able to open approximately 27,000 ephemeral ports. This has been tested on Windows 2000 and XP. However, the values in Table 1 may need to be created.
Data Management
When using I/O multiplexing, a scheme for receiving/managing data received from a client is critical. In the thread-pool model, all the data received from the client is kept locally on the worker thread stack that handles the client connection. In I/O multiplexing, multiple file descriptors are processed on a single thread. To preserve the data that is being read from the client, an association must be built between the client and the buffers that hold client data. This association occurs with the concept of a session, which monitors and tracks the lifetime of the client connection.
Session.h and session.cpp (available electronically; see "Resource Center," page 5) implement a session class that only allows itself to be allocated on the heap. When the client connection is terminated, the session is cleaned up. To uniquely identify each session that is created, the integer value of the file descriptor for the client connection is used. Each session is created on the heap to prevent the session from being removed by leaving scope. Each session is tracked in a Singleton class (or double-checked lock pattern) called a "session manager." Sessions are used to store data buffers and to ensure that each file descriptor has a unique buffer for data. Sessionmanager.h and sessionmanager.cpp (available electronically) demonstrate a Singleton object that provides a container for sessions.
State Management
Managing state is more complex when working with I/O multiplexing. In thread pooling, the client connection state is usually stored in the local variables of the worker thread. This is not an option for storing state when using I/O multiplexing. State, such as where your client connection is in transmission of protocol information, must be stored with the session because any local variables would be overwritten by the next client connection being processed. You can either overload your sessions with state information for all the protocols that your server handles, or add protocol objects to your session objects (which can be added and removed throughout the lifetime of the connection). Allowing protocol objects to store state provides a layer of abstraction between a session's life cycle and the protocol being used to process the data in the client connection.
Additional session state management can be done through reference counting to make sure that a session isn't destroyed while it is being used by another thread. A smart pointer class is an excellent mechanism to manage reference counts and assist in managing of the session's lifecycle. A smart pointer class can ensure that the reference to a session is decremented when a smart pointer is removed from the stack.
Cross-Platform I/O Multiplexing Abstraction
Because today's server environments are typically heterogeneous, code needs to be cross platform. Cross-platform abstraction is implemented in a class called IOEventHandler. The Windows implementation of this class uses an I/O completion port class (available electronically). The cross-platform abstraction is IOEventHandler (also available electronically), which consists of IOEventHandler.h, UnixIOEventHandler.cpp, and Win32IOEventHandler.cpp. UnixIOEventHandler.cpp and Win32IOEventHandler.cpp both include IOEventHandler.h and are only compiled into object files on their respective platforms. This mechanism provides platform implementations that are decided at compile time, reuse the same interface, and reduce the usage of #defines. IOEventHandler has four public methods: Make(), Add(), Remove(), and WaitForEvents().
The Make() method is used to create a new instance of the IOEventHandler. This prevents IOEventHandler objects from being created on the stack. For Windows only, one IOEventHandler should be created for all working threads. On UNIX systems, where the native I/O multiplexing routines don't choose the next available thread on which to process the I/O, IOEventHandler objects can be created per worker thread. IOEventHandler objects can be managed through a Singleton class (double-checked lock pattern) that associates each IOEventHandler object with a worker thread.
The Add() and Remove() methods are used to add/remove file descriptors from being watched for events by the IOEventHandler object. Remove() is a no-op on Windows. IOCompletionPort removes the file descriptor from the IOCompletionPort when the file handle is closed.
The WaitForEvents() method is called by each worker thread to block while waiting for I/O events to be ready for processing. WaitForEvents() calls into two private methods: handleEvents() and processData(). handleEvents() checks to make sure that no error occurred on the file descriptor and then calls processData(). processData() is responsible for parsing data from the input stream.
Iocompletionport.h and iocompletionport.cpp (available electronically) are classes that provide a wrapper around I/O completion ports. Unixioeventhandler.cpp and win32ioeventhandler.cpp show how /dev/poll and I/O completion ports are wrapped into a single interface.
Conclusion
Today's servers must handle ever- increasing loads. Using I/O multiplexing lets servers utilize system resources in an efficient manner and facilitates scalable server applications.
References
http://docs.sun.com/.
http://msdn.microsoft.com/.
DDJ
Listing One
// Get Handle to the /dev/poll device if ((fd = open("/dev/poll", O_RDWR)) < 0) { return EPERM; } //create a pollfd structure large enough to hold all file // descriptors that will be watched pollfd = (struct pollfd* )malloc(sizeof(struct pollfd) * MAXBUF); if (pollfd == NULL) { close(fd); return ENOMEM; } // Setup dev poll to watch for I/O events from file descriptors for (i = 0; i < nFileDescriptorCount; i++) { pollfd[i].fd = fds[i]; pollfd[i].events = POLLIN; pollfd[i].revents = 0; } if (write(fd, &pollfd[0], sizeof(struct pollfd) * MAXBUF) != sizeof(struct pollfd) * MAXBUF) { perror("failed to write register file descriptor with /dev/poll"); close (fd); free(pollfd); return EBADF; } // Get I/O and events from the dopoll.dp_timeout = -1; dopoll.dp_nfds = MAXBUF; dopoll.dp_fds = pollfd; nDescriptorsSignalled = ioctl(wfd, DP_POLL, &dopoll); if (nDescriptorsSignalled < 0) { perror("/dev/poll ioctl DP_POLL failed"); close (fd); free(pollfd); return EINVAL; } while( nDescriptorsSignalled-- ) { read(dopoll.dp_fds[i].fd, rbuf, STRLEN); }
Listing Two
//create initial I/O completion port with the number of worker threads hCompletionPort = CreateIoCompletionPort(INVALID_FILE_HANDLE, NULL, 0, nWorkerThreads); if( hCompletionPort == NULL){ return GetLastError(); } //Add a file handle for the I/O completion to watch hCompletionPort1 = CreateIoCompletionPort( hFile, hCompletionPort, dwCompletionKey, nWorkerThreads); if( hCompletionPort1 == NULL ){ return GetLastError(); } //perform Non-Blocking read so GetQueuedCompletionStatus can notify // us when the read has completed. nRet = WSARecv(hFile, lpBuffers, dwBuffCount, &dwBytesReceived, &dwFlags, &OverLapped, NULL); if( nRet != 0 ){ return WSAGetLastError(); } //block on GetQueuedCompletionStatus until the I/O operation completes. // The I/O will be in the buffer specified supplied in WSARecv nRet = GetQueuedCompletionStatus(hCompletionPort, &dwBytesReceived, &dwCompletionKey, &OverLapped, TIME_OUT); if( nRet == 0 && GetLastError() != WAIT_TIMEOUT){ return GetLastError(); }