Matt is a programmer for Nu-Mega Technologies and the author of Windows 95 System Programming Secrets (IDG Books, 1995), from which this article is adapted. He can be contacted at [email protected].
One of the most difficult areas of Windows 95 programming is thunking. Many of the 32-bit system DLLs in Windows 95 rely heavily on 16-bit code. For example, when your 32-bit program calls GetMessage, control ultimately ends up in the GetMessage routine in the 16-bit USER.EXE. Likewise, there are many places where 16-bit system code makes calls to 32-bit system DLLs. For example, when creating a new window, the 16-bit USER.EXE thunks up to KERNEL32.DLL to allocate memory for the window's data from a 32-bit heap. Windows 95 handles both types of thunks (32- to 16-bit, and 16- to 32-bit) seamlessly.
Unfortunately, if you need to call from 32-bit code to 16-bit code (or vice versa) you're supposed to write what are called "flat thunks," which require the use of a special compiler (THUNK.EXE), an assembler, and several hours (or days) of trial and error. Once you get it working correctly, you end up with at least two extra DLLs, one a 16-bit and the other a 32-bit.
Although standard Windows 95 DLLs thunk between 16- and 32-bit code, these system DLLs don't mess around with the thunk compiler or thunking DLLs. Instead, the 32-bit system code uses the undocumented function QT_Thunk. You can use this same routine from your own programs. In fact, if you write thunking DLLs, you'll see that the code that THUNK.EXE emits contains calls to the QT_Thunk function. Be careful, though. Using QT_Thunk is trickier than calling a standard Win32 API function.
In this article, I'll describe what's necessary to call QT_Thunk on your own. I'll also present a sample program that uses QT_Thunk to bypass the thunk compiler. In effect, I'll be thunking in the same manner as the system DLLs. As part of the example, you'll see how a 32-bit program can obtain the free system resources. This is a question I see quite often in online forums, so I can cover two interesting topics at once. However, before jumping into the example code, it's instructive to first look at QT_Thunk and see how it works. In this way, we can get a sense of the Windows 95 architecture. Remember that this function is entirely Windows 95 specific. QT_Thunk (and flat thunks in general) don't exist in Windows NT.
Examining the QT_Thunk function
Listing One is pseudocode I've created for QT_Thunk. QT_Thunk is located in KERNEL32.DLL, and is quite obviously coded in assembler. Rather than presenting the raw assembler code for QT_Thunk, Listing One is a mix of C pseudocode and assembler, which I believe conveys the intent of this fairly complex routine. If you really want to see what goes on, by all means, set a breakpoint on QT_ Thunk in your favorite system debugger (such as SoftIce/W) and step through it. I guarantee that you won't wait long for the breakpoint to be hit.
Looking at the routine from orbit, the job of QT_Thunk is simple: Take the 16:16 address passed to it in the EDX register and transfer control to that address. Of course, nothing is ever that simple, and there are several issues that need to be taken care of. For starters, saving away the address that execution should return to after the 16-bit code finishes would be very helpful. Likewise, it's important to switch the stack from a flat 32-bit stack selector to a 16-bit selector.
Moving in a bit closer to the routine (a "helicopter view" if you will), QT_Thunk is divided into five distinct phases. First, in the debug version, the code calls a routine that logs the call (assuming the right logging flag is set). This section of code also verifies that the Thread Information Block selector is the same as the FS register. If not, the routine complains (in the debug version, that is).
Phase 2 of QT_Thunk pushes the 16:16 address that's the ultimate target of the thunk onto the stack. I'll come back to this in phase 5. Phase 2 also handles the preservation of the return address and the 32-bit register variables. The 32-bit return address that control returns to after the 16-bit code completes is stored in an area of the stack that won't be touched. The register variables that are saved away are ESI, EDI, and EBX. These are the commonly used register variables that Win32 compilers expect will be preserved across routines.
Phase 3 of QT_Thunk relates to acquiring the Win16Mutex. Whenever 32-bit code thunks down to 16-bit code, Windows 95 needs to acquire the Win16Mutex. The Win16Mutex is just a critical section, although it's somewhat unique in that it's used by all processes, both 16- and 32-bit. By forcing all Win32 code that thunks down to 16-bit code to acquire the Win16Mutex, Windows 95 can guarantee that only one thread at a time is executing through the Win16 system DLLs (as well as other 16-bit DLLs). This is how Microsoft got around the unfortunate fact that the 16-bit system DLLs were written without multithreading in mind. The whole subject of the Win16Mutex has been highly controversial. Here I'm simply going to say that the QT_Thunk routine is one of the primary places where Windows 95 acquires the Win16Mutex.
Phase 4 of QT_Thunk is where the routine switches from the flat 32 stack used by the Win32 code to a 16:16 stack for use by the Win16 code. Since Win32 threads typically have 1-MB stacks, and the ESP at the time of the thunk could be anywhere within that 1 MB, you can see that switching to a 16:16 stack could be tricky. It's not sufficient to just allocate a 16-bit stack selector during the thread's start-up and set its base address at that time. Instead, during the thunk to 16 bits, the QT_Thunk routine may need to adjust the base address of the stack selector used by the thread when executing in 16-bit code. The base address of the 16-bit selector is set so that it points to the same general linear-address region that the ESP register was using prior to the thunk. After fiddling with the stack selector as necessary, QT_Thunk figures out an appropriate 16-bit SS:SP combination and loads those values into the SS and SP registers.
The final phase of QT_Thunk transfers control to the 16:16 address that's the target of the thunk. As I showed in phase 2, the 16:16 target address was stored in EDX upon entry to QT_Thunk, and was subsequently pushed on the stack. QT_Thunk jumps to the 16:16 address via the standard RETF trick. But before transferring control to that address, the QT_Thunk code zeros out all the nonessential segment registers (DS, ES, FS, and GS). It wouldn't do to hand the target 16:16 function a DS register set up with a nice, juicy, flat 32-bit selector to scribble with. It's expected that the 16:16 function will set up the segment registers however it needs to.
Note that there is no memory context switching in the QT_Thunk code. At the CPU level, the only changes from one side of the thunk to the other were that the instruction pointer and the stack switched from using 32-bit segments to 16-bit segments. This is significant if you think about the memory arrangement of Windows 95. It implies that all 16-bit DLLs are always mapped into the address space of the current application. Put another way, all 16-bit DLLs reside in memory that is global between all processes. While this sacrifices overall robustness, it is key to keeping Windows 95 flat thunks both small and fast.
Calling QT_Thunk on Your Own
A good example to demonstrate QT_ Thunk is to get the free system resources (FSRs). If you're familiar with Windows 3.x, you might recall that the free system resources value is simply the percentage of free space in the 16-bit USER and GDI local heaps. Believe it or not, Windows 95 provides no way for 32-bit applications to get the FSR value. Even when the standard Windows 95 utilities display the FSR in their About box, the FSR value is retrieved by a call that thunks down to a 16-bit DLL. The 16-bit USER.EXE contains a function called GetFreeSystemResources. If only you could get to it from 32-bit code without writing thunk scripts. Since GetFreeSystemResources takes only one simple parameter, it's a perfect candidate for using QT_Thunk and dispensing with thunk scripts.
Calling QT_Thunk in your code requires you to do two things. First, you have to put the 16:16 address to call into the EDX register. Second, the code from which you're calling QT_Thunk should have an EBP stack frame set up, and at least 0x3C bytes of local storage that you're not relying on. This unusual step is required because QT_Thunk builds its convoluted stack frame for calling the 16-bit code in the region below where the EBP register points to.
The FSR32.C program in Listing Two provides a sample implementation of a call to QT_Thunk. Besides the two calls to QT_Thunk, there are a couple of other things in Listing Two that need describing. For starters, how is FSR32.C getting the address of the 16-bit GetFreeSystemResources function from 32-bit code? I had to cheat a bit and use some additional undocumented KERNEL32.DLL functions. The functions are LoadLibrary16, FreeLibrary16, and GetProcAddress16. These functions are just like their 16-bit counterparts, but can be called directly from 32-bit code.
Since LoadLibrary16, FreeLibrary16, and GetProcAddress16 aren't in Windows NT, you won't find them in the standard KERNEL32.LIB import library. Instead, I got sneaky and created my own import library for them, using the UNDOCK32.DEF file in Listing Three. The prototypes for these three undocumented functions appear at the top of Listing Two. As you might guess, there are more than just three undocumented functions in KERNEL32.DLL. (See my book Windows 95 System Programming Secrets for a complete list of the undocumented functions in KERNEL32.DLL, along with corresponding .DEF and .LIB files.)
To ensure that there are at least 0x3C bytes of unused memory below the EBP frame so that QT_Thunk can set up its peculiar stack frame, the code declares a local array of 0x40 characters that it doesn't use for anything. The QT_Thunk code can bash this memory with impunity. Any variables that are important to FSR32.C are declared as globals, and can't be trashed by QT_Thunk. (I learned this lesson the hard way!) FSR32.C makes the actual call to QT_Thunk using inline assembly code. The reason FSR32.C doesn't make a regular C call to QT_Thunk is because EDX needs to be set up with the 16:16 addresses to call beforehand. You could theoretically just load EDX with one line of inline assembly code before calling QT_Thunk. However, you'd be relying on the compiler to not trash the EDX register before the CALL instruction executes.
Final Thoughts
To build FSR32.EXE, use the included BUILDFSR.BAT file (available electronically; see "Availability," page 3). This batch file first invokes the Microsoft LIB program to turn the UNDOCK32.DEF file into an import library. Afterwards, the batch file calls the command-line compiler (CL.EXE) and passes the file to be compiled. The only additional arguments are the two import libraries, UNDOCK32.LIB and THUNK32.LIB. THUNK32.LIB comes with the Win32 SDK and with Visual C++. This means there is no complicated .MAK file. The resulting program is a console-mode program that you can run directly from the command line. The output is the percentage of free space in both the USER and GDI heaps.
Be advised that the FSR32 code doesn't do anything tricky like passing pointers to 16-bit code. The Win32 API functions that thunk down to 16-bit code and pass pointers to 16-bit DLLs, have additional code for setting up alias 16:16 selectors, and so forth. The main point here is that if you're going to do anything at all tricky, I would suggest using the thunk compiler, which really is the proper way of doing things. The earlier example passes only one parameter, and that parameter doesn't require any translation to be used by the 16-bit code. Examples of parameters that would need to be translated include pointers and window message values. In short, while QT_Thunk can be handy, think carefully before you decide to bypass the thunk compiler and use Windows 95 thunks directly. On the other hand, if you can avoid thunk scripts and creating extra DLLs, go for it!
Listing One
Pseudocode for QT_Thunk // On entry, EDX contains the 16:16 address to transfer control to // // Phase 1: logging and sanity checking // if ( bit 0 not set in FS:[TIBFlags] ) goto someplace else; // Not interested in that here PUSHAD // Save all the registers SomeTraceLoggingFunction( "LS", EDX, 0 ); // EDX is 16:16 target // Make sure that the FS register agrees with the TIB register stored // in the current thread database. if ( (ppCurrentThread->TIBSelector != FS) && (ppCurrentThread != SomeKERNEL32Variable) ) { _DebugOut( SLE_MINORERROR, "32=>16 thunk: thread=%lx, fs=%x, should be %x\n\r", ppCurrentThreadId, FS, ppCurrentThread->TibSelector ); } POPAD // restore all the registers // // Phase 2: saving away the return address and register variable registers // POP DWORD PTR [EBP-24] // Grab return address off the stack // and store it away for later use PUSH DWORD PTR [someVariable] // ??? PUSH EDX // Push 16:16 address on the stack. The RETF // at the end will effectively JMP to it. MOV DWORD PTR [EBP-04],EBX // Save away the common MOV DWORD PTR [EBP-08],ESI // compiler register variables MOV DWORD PTR [EBP-0C],EDI // // Phase 3: Acquiring the Win16Mutex // PUSHAD, PUSHFD // Save all registers _CheckSysLevel( pWin16Mutex ) POPFD, POPAD // restore all registers FS:[Win16MutexCount]++; // If we don't have the mutex already, if ( FS:[Win16MutexCount] == 0 )// grab it now. GrabMutex( pWin16Mutex ); PUSHAD, PUSHFD // Save all registers _CheckSysLevel( pWin16Mutex ) POPFD, POPAD // restore all registers // // Phase 4: Saving off the old SS:ESP and switching to the 16:16 stack // Calculate the 16:16 stack pointer MOV DX,WORD PTR [EDI->currentSS] // Load DX with 16 bit SS MOV DI,SS // Save away the flat SS value into DI // (The callee is expected to preserve it) MOV SS,DX // Load SS:(E)SP with the 16 bit stack ptr MOV ESP,ESI SUB EBP,EBX // Adjust EBP for the thunk MOV SI,FS // Save away FS (TIB ptr) register into SI // (The callee is expected to preserve it) // // Phase 5: Jumping to the 16:16 bit code // GS = FS = ES = DS = 0; // Zero out the segment registers RETF // Effectively does a JMP 16:16 to the address // passed in the EDX register
Listing Two
//================================== // FSR32 - Matt Pietrek 1995 // FILE: FSR32.C //================================== #define WIN32_LEAN_AND_MEAN #include <windows.h> #include <stdio.h> #pragma hdrstop typedef int (CALLBACK *GFSR_PROC)(int); // Steal some #define's from the 16 bit WINDOWS.H #define GFSR_GDIRESOURCES 0x0001 #define GFSR_USERRESOURCES 0x0002 // Prototype some undocumented KERNEL32 functions HINSTANCE WINAPI LoadLibrary16( PSTR ); void WINAPI FreeLibrary16( HINSTANCE ); FARPROC WINAPI GetProcAddress16( HINSTANCE, PSTR ); void __cdecl QT_Thunk(void); GFSR_PROC pfnFreeSystemResources = 0; // We don't want these as locals in HINSTANCE hInstUser16; // main(), since QT_THUNK could WORD user_fsr, gdi_fsr; // trash them... int main() { char buffer[0x40]; buffer[0] = 0; // Make sure to use the local variable so that the // compiler sets up an EBP frame hInstUser16 = LoadLibrary16("USER.EXE"); if ( hInstUser16 < (HINSTANCE)32 ) { printf( "LoadLibrary16() failed!\n" ); return 1; } FreeLibrary16( hInstUser16 ); // Decrement the reference count pfnFreeSystemResources = (GFSR_PROC) GetProcAddress16(hInstUser16, "GetFreeSystemResources"); if ( !pfnFreeSystemResources ) { printf( "GetProcAddress16() failed!\n" ); return 1; } __asm { push GFSR_USERRESOURCES mov edx, [pfnFreeSystemResources] call QT_Thunk mov [user_fsr], ax push GFSR_GDIRESOURCES mov edx, [pfnFreeSystemResources] call QT_Thunk mov [gdi_fsr], ax } printf( "USER FSR: %u%% GDI FSR: %u%%\n", user_fsr, gdi_fsr ); return 0; }
Listing Three
LIBRARY KERNEL32 EXPORTS LoadLibrary16@4 @35 FreeLibrary16@4 @36 GetProcAddress16@8 @37