Karen is a lecturer in the Intelligent Systems and Control Group at Queen's University, Belfast; and Stuart is with the School of Electronics, Electrical Engineering, and Computer Science. They can be contacted at [email protected] and [email protected], respectively.
As the bandwidth of Internet connections continues to increase, the idea of distributed virtual reality is itself becoming a reality. While we're already seeing it to some degree in multiplayer computer games, distributed virtual reality goes further, making it possible for participants to believe they are gathered together in the same place at the same time. But to really believe what your eyes and ears are telling you, you need to be able to touch things and get a sense of space and depth, something that can only come by seeing stereoscopically.
Until recently, stereoscopic display technology has been either extravagantly expensive or rather crude. There's no longer any reason for this. The availability of inexpensive webcams that interface to standard PC equipment via USB means that purchasing two color video input devices costs less than $50. Likewise, the cost of stereoscopic display devices are also plummeting. Shutter-eye glasses such as those from eDimensional (http://www.edimensional.com/), for instance, deliver the stereoscopic effect by doubling the frame rate and inserting special sync signals in the video raster that are not normally seen on screen. In this article, we examine real-time VR by showing you how to use DirectX and OpenGL in a commodity PC to build a stereoscopic webcam.
Concept and Hardware
Figures 1 and 2 present the block architecture of the client/server software that powers our stereo TV.
The server application uses the DirectShow API to acquire synchronous video frames from two image sources and write them into a RAM buffer as a single frame. It can optionally preview the stereoscopic images from the cameras and deliver the images from the buffer in response to requests from clients.
The client application uses TCP/IP to request a left/right (L/R) image pair from the server. It uses the same stereoscopic display capabilities of the 3D graphics OpenGL libraries as are used by the server in its preview function to display the output. To focus on the main code details, we've excluded all error-processing statements and most variable declaration constructs from Listings One through Eleven (available electronically; see "Resource Center," page 5).
To test the programs, we used two USB 2.0 compatible webcams (Trust 360 Spacecam, http://www.trust.fr/products/product .aspx?artnr=13959) that deliver output at a resolution of 640×480 and at a rate of up to 30 frames per second (fps). Lower resolutions and frame rates are achieved when USB 1.0 cameras are used. The PC was standard512-MB RAM and an Intel Pentium 4 3.2-GHz processor. Interestingly, the processor's Hyper-Threading Technology had to be disabled because it interfered with the camera drivers when two cameras were connected. Stereoscopic display was delivered using a 100-Hz refresh rate CRT and low-cost infrared shutter-eye glasses system from eDimensional. The glasses were connected via the VESA Standard Stereo Device Connector to an nVidia Quadro 4 750XGL display adapter. The same stereo display system was used in the client computer that had an AMD Athlon 1.35-GHz processor.
Server Program Architecture
The server application program is a conventional Win32 project written in C++. The application is implemented using multiple threads and the overall design strategy and data flow is illustrated in Figure 1. There are four primary threads:
- The main application.
- A thread-handling data acquisition from the left video camera.
- A thread-handling data acquisition from the right video camera.
- A thread-handling TCP/IP request for video frames to be sent to client applications.
The left/right video camera threads also handle the option of previewing the video in either one stereoscopic capable window or as individual windows for the left/right camera. The thread that deals with the TCP/IP communications via the Windows sockets interface spawns additional threads to handle client connections. The image data flows from the camera threads into two screen buffers (implemented as a single block of RAM). The images in these buffers are read by the display thread and the communications thread as necessary (Figure 1).
The WinMain application entry point sets up a standard Win32 API by registering necessary window classes, identifying a message-processing function to handle messages for the application's main window followed by instancing a window of that class. It also initializes the COM object libraries and proceeds to spawn the three work threads: the sockets thread and two camera threads that acquire the left and right images. Once this is done, it goes into a Windows message loop until a WM_QUIT is detected. Finally, the main process thread kills the child processes, cleans up any memory buffers, and exits.
All the working threads have read or write access (appropriately protected via critical sections in the code) to two global variables that define the horizontal and vertical resolutions of the video output from the cameras and a memory buffer that holds the video image frames. The video buffer stores 24-bit RGB pixels and at twice the camera resolution. The first half of the video buffer is used to store the pictures from the left camera and the second half is used to store the pictures from the right camera. The camera threads have write access to these buffers, the rendering and sockets thread have read-only access. Because the resolution of the camera won't change during program execution, the resolution is only determined once at the start of the initialization of the first camera thread to run. It is assumed that the second camera is going to operate with the same dimensions. Listing One presents the most-significant global variables and the C++ class of the object responsible for controlling the video acquisition from the cameras. Listing Two is the most significant features of the WinMain function.
Each camera thread uses the same code (entry point, void GLWindowPr(void *arg), Listing Three). When the thread is spawned, the _beginthread(..) function is passed a pointer to a structure with a member variable to indicate whether this thread is to look for and use the left or right camera and to copy the image data from the video source into the corresponding frame buffer, left camera video data to left buffer, and so on. The threads act independently, but use the DirectShow synchronization mechanism to get as little delay as possible between the left and right video streams. (In practice, this works well with USB 2.0 cameras but less so with USB 1.0 cameras attached to slower host computers.)
To facilitate a preview of the acquired video, the camera threads create floating display windows. If the server's display hardware is stereo enabled, only one of these windows is made visible; otherwise, both are shown. The window message function initializes the OpenGL display system and configures it to draw into the rendering thread's window. OpenGL uses left and right double buffers to support stereoscopic display. The message-handling functions use a timer to render at regular intervals. The OpenGL glDrawPixels(..) function fulfills the task of making visible the video data buffer. Listing Four has an outline of the code for the rendering and window message handling.
Before there is any video to render, however, the GLWindowPr(..) function must build and run a DirectShow filter graph to capture the video. The key steps are in Listing Five. You need an instance of a custom rendering filter for the graph, and this is created in the line:
pRenderer = new CRenderer(this, id, NULL, &hr);
Listing Six defines the class for this object and its member variables and methods.
Function CaptureVideo(..) (Listing Seven) uses a CaptureBuilder2 object to link up the filters in the graph. The video sources are enumerated and one is selected for the left or right source (depending on whether this is the first or second camera handler thread). Function FindCaptureDevice(..) (Listing Eight) shows how this is achieved using the DirectShow COM objects. Looking back at Listing Seven, the CaptureBuilder2 method RenderStream() links the video source with our CRenderer filter. It is the job of our custom filter to copy the DirectShow media samples into the video buffer (Listing Nine). Once built, the filter graph is run via the DirectShow media control interface. The video acquisition threads wait in a message-processing loop until terminated by the application's main thread.
The server application's socket thread binds to a high TCP port and listens for a connection from a client. The socket thread is terminated by a signal from the main thread. When a client connects, a new handler thread is started to serve out the video frames. The client handler thread (Listing Ten) carries out a short handshaking protocol with the client to determine whether it wants: stereo images, full-size or half-size frames, and the compression method to be employed. We've been using a simple lossless key-frame/delta-frame scheme. In this scheme, no special processing is required for stereo frames because the left and right images are stored as one continuous buffer (top/bottom), and to the compression code, it just looks like one double-height frame. The 24-bit color depth images are first compressed to 5-6-5 16-bits per pixel. Run-length encoded full frames are sent every second to allow clients to synchronize fully after a maximum delay of one second. The remaining video frames (up to a rate of 25 fps) are sent as run-length differences from the last frame. Key features from the compression routines appear in Listing Eleven (all listings are available electronically).
The Client Program
Like the server, the client is a standard Win32 application. The implementation has much in common with the server so the details can be followed in the comments from the full source codes. A couple of points worth mentioning are the use of a timer to send instructions to the server for it to send video frames and that the client is configured to allow several requests to be in the pipeline at any one time. (This helps to minimize communication delays. The TCP/IP mechanism handles buffering and ensures that frames arrive in the correct order.)
The organization of the double-height image buffer mirrors that of the server. A data acquisition thread is responsible for filling this buffer by receiving compressed video frames from the server and uncompressing them into the image buffer. OpenGL pixel-drawing functions copy the image data from the double-sized image buffer into the left and right off-screen buffers with the glDrawPixels() function.
Tools
We built the programs using both Microsoft's Visual Studio .NET 2003 and Visual Studio 6.0. The Direct X 9.0 SDK is needed. This changes frequently; we used a distribution from 2004, but the dxsdk_feb2005 will do the job, too. You'll need the dxsdk_feb2005_extras for DirectShow, and it will be necessary to build the libraries from the DirectShow BaseClasses in the Samples\C++ folder and configure your development environment with the appropriate search paths for headers and libraries. The DirectX SDK is available at http://www .microsoft.com/windows/directx/default.aspx. Our project code is available electronically from DDJ; see "Resource Center," page 5.
Trying It Out
To test our setup, we placed the server on top of a tall building and connected it to one node on our campus computer network. The stereo projection equipment in our VR theater is equipped with a high-end stereo projector and it was a simple matter to run a client there. Both ends of the pipe were hooked in via 10-MB Ethernet networks separated by at least two router hops.
We ran our programs for several hours and used sources with resolutions of 320×240 and 640×480. The lower resolution worked very well up to frame rates of 25 fps, while the higher was quite capable of achieving 15 fps. Our initial fear that we would not be able to provide suitable synchronization between the left and right images proved groundless; with USB 2.0 cameras and the 15 fps, we were able to achieve quite satisfactory results for all the motion video scenarios we investigated.
Conclusion
With a couple of pieces of simple software, two webcams from the local mart, and almost any off-the-shelf PC, you can build a pretty good experimental stereoscopic web TV. The fact that the necessary hardware costs a fraction of what is being spent in big-budget VR projects might, just might, open up access to use virtual reality in applications not previously considered worthy of the cost.
DDJ