Important Note


An updated version with a built-in disassembler is now available.


Mhook, an API hooking library


Mhook is a library for installing API hooks. If you dabble in this area then you’ll already know that Microsoft Research's Detours pretty much sets the benchmark when it comes to API hooking. Why don't we get a comparison out of the way quickly then?

Detours vs. Mhook


Detours is available for free with a noncommercial license but it only supports the x86 platform. Detours can also be licensed for commercial use which also gives you full x64 support, but you only get to see the licensing conditions after signing an NDA.

Mhook is freely distributed under an MIT license with support for x86 and x64.

Detours shies away from officially supporting the attachment of hooks to a running application. Of course, you are free to do it - but if you end up causing a random crash here or there, you can only blame yourself.

Mhook was meant to be able to set and remove hooks in running applications – after all, that’s what you need it for in the real world. It does its best to avoid overwriting code that might be under execution by another thread.

Detours supports transactional hooking and unhooking; that is, setting a bunch of hooks at the same time with an all-or-nothing approach. Hooks will only be set if all of them can be set, otherwise the library will roll back any changes made. Mhook does not do this.

Detours has a built-in x86 (and, when paid for, x64) disassembler so it can automatically hook an API. This is the fundamental difference between Detours and Mhook, and probably the only one that really needs improvement: Mhook has no disassembler so the user must first, by hand, examnine the first few bytes of the target API and make the resulting information available to Mhook. This also means that Mhook will not function on an OS where the disassembly of the target function’s first few bytes is different from what has been anticipated. It is possible to give Mhook information on several possible disassemblies at once, thereby supporting multiple operating systems, but this is a bit inconvenient. On the other hand, the lack of a disassembler allows the library to remain very lightweight.

Finally, Mhook is pretty wasteful when it comes to allocating memory for the trampolines it uses. Detours allocates blocks of memory as needed, and uses the resulting data area to store as many trampolines within as will fit. Mhook, on the other hand, uses one call to VirtualAlloc per hook being set. Every hook needs less than 100 bytes of storage so this is very wasteful, since VirtualAlloc ends up grabbing 64K from the process' virtual address space every time Mhook calls it. (Actual allocated memory will be a single page which is also quite wasteful.) In the end though, this probably does not really matter, unless you are setting a very large number of hooks in an application. Also, this is very easy to fix.

With that out of the way, if you’re still here, let’s delve into it.

Example: Hooking NtOpenProcess on x86 and x64


It might be best to start off with a short code example that shows the library in action. The snippet below hooks NtOpenProcess on both x86 and x64 Vista.

//=========================================================================
#include "stdafx.h"
#include "mhook.h"

//=========================================================================
// Define _NtOpenProcess so we can dynamically bind to the function
//
typedef ULONG (WINAPI* _NtOpenProcess)(OUT PHANDLE ProcessHandle, 
	IN ACCESS_MASK AccessMask, IN PVOID ObjectAttributes, 
	IN PCLIENT_ID ClientId ); 


//=========================================================================
// Get the current (original) address to the function to be hooked
//
_NtOpenProcess TrueNtOpenProcess = (_NtOpenProcess)
	GetProcAddress(GetModuleHandle(L"ntdll"), "NtOpenProcess");

//=========================================================================
// This is the function that will replace NtOpenProcess once the hook 
// is in place
//
ULONG  WINAPI MyNtOpenProcess(OUT PHANDLE ProcessHandle, 
			  IN ACCESS_MASK AccessMask, 
			  IN PVOID ObjectAttributes, 
			  IN PCLIENT_ID ClientId )

{
	// do any processing here if needed
	// ...
	// punt the call to the the OS in the end
	return TrueNtOpenProcess(ProcessHandle, AccessMask, 
		ObjectAttributes, ClientId);

}

//=========================================================================
// The first few instructions at ntdll!NtOpenProcess
// These are the bytes that we're going to overwrite.
//
BYTE OriginalNtOpenProcess[] = {

#ifdef _M_IX86
	0xb8, 0xc2, 0x00, 0x00, 0x00,	// mov     eax, 0C2h 
	0xba, 0x00, 0x03, 0xfe, 0x7f,	// mov     edx, 7FFE0300h
#elif defined _M_X64
	0x4c, 0x8b, 0xd1,			// mov     r10, ecx
	0xb8, 0x23, 0x00, 0x00, 0x00,	// mov     eax, 23h
#else
	#error unsupported platform
#endif
};


//=========================================================================
// This is how you go about putting the hook in place. When you're done,
// any calls to NtOpenProcess will be redirected to MyNtOpenProcess.

// If you need to access the original, unmodified API, call 
// TrueNtOpenProcess from your code, just like the hook function above does.
//
BOOL WINAPI SetHooksAndDoWork () {

    BOOL bHook = Mhook_SetHook((PVOID*)&TrueNtOpenProcess, 
			MyNtOpenProcess, 
			OriginalNtOpenProcess, 
			sizeof(OriginalNtOpenProcess));

    // Minimalist error handling
    if (!bHook) return FALSE;

    // ... any calls to NtOpenProcess within this process are now

    // rerouted to MyNtOpenProcess.

    // For example, this call will end up in our hook function since
    // kernel32!OpenProcess just calls ntdll!NtOpenProcess internally
    HANDLE hProc = OpenProcess(PROCESS_ALL_ACCESS, FALSE,

    GetCurrentProcessId());
	
    // ...

    // This call will bypass the hook:
    // (parameter initialization omitted for brevity)
    TrueNtOpenProcess(&hProc, &accessMask, &objAttrs, &clientId);

    // ...

    // Remove the hook when we're done.
    return Mhook_Unhook((PVOID*)&TrueNtOpenProcess);

}

//=========================================================================

Overview


As you can see the library is pretty easy to use. It gets only slightly more complicated under the hood.

When hooking a Windows API function you determine the location of the function, change the page protection so the memory can be written to, modify the function so it jumps to your own code rather than doing its own thing (other processes that have this DLL loaded will be unaffected due to the OS’s copy-on-write mechanism) – and you’re done.

Normally you’d do this from a DLL that you inject into the target process via any of the well-established methods: the AppInitDLLs registry key, using CreateRemoteThread, or even SetWindowsHookEx . When the DLL is loaded, you set your API hook in DLL_PROCESS_ATTACH, and remove it in DLL_PROCESS_DETACH.

Of course you will need to do a bit of housekeeping, the most important aspect of which is ensuring that the original API will be available if and when you need it: for one, your replacement function will probably have to be able to access the original API.

So how exactly do we tie up the loose ends?

When a hook is set, the library allocates a chunk of memory that will contain the trampolines. (No, this isn’t a term Microsoft came up with: see http://en.wikipedia.org/wiki/Trampoline_(computers).) The trampolines will need to be within +/- 2GB of the target API because the hooked version of the function will begin with a 32-bit relative jump once the hook is in place. A 32-bit relative jump takes up 5 bytes on both x86 and x64. This is the smallest JMP instruction useful for our purposes, and it’s important to keep these JMPs small because space is sometimes at a premium in target APIs.

Fortunately the allocation process is easy, thanks to VirtualQuery and VirtualAlloc. You start enumerating memory blocks with VirtualQuery at (TargetFunctionAddr – 2GB), and loop until you find a free block. Once a free block has been found, you try to allocate some memory for your trampolines. If you succeed you’re done with this step – if you fail you simply keep trying until a block is successfully allocated, or your base address goes beyond (TargetFunctionAddr + 2GB), at which point you must give up.

Once we have a suitable chunk of memory, we create two trampolines in it. The first one will contain the first few machine-code instructions from the target API (these will be overwritten by the JMP instruction in the original location so we must duplicate them somewhere) followed by a jump to the rest of the API back in its original location. Calling this trampoline is functionally identical to calling the original API before we hooked it – the address of the trampoline can therefore be used as a secret entry point for the original, unmodified functionality of the API. The other trampoline is just a crutch: the JMP instruction at the beginning of the hooked API will point here. The crutch will contain another JMP instruction that points at our replacement function. The reason for not JMPing directly from the original API to the replacement function is that we cannot be certain that the replacement function (which will most likely just reside in our DLL as a C/C++ routine) is within +/- 2GB of the original function – but the crutch is guaranteed to be there. We therefore insert a 5-byte version of the JMP instruction in the original function and have room for another JMP in the trampoline, thus eliminating the problem of our replacement code potentially being further than 2GB from the target API.

An absolute JMP on x86 requires 9 bytes: one for the E9 opcode, four for the absolute address of the memory location that stores the target address, and four for the aforementioned memory location with the target address itself. The same instruction requires 13 bytes on x64: one byte for the E9 opcode, four for a relative offset pointing to the memory location of the absolute address, and 8 bytes at the aforementioned location that point at the target location itself. You could do this with shorter instructions if you didn’t care about preserving register contents, but that’s not an option in our case.

Putting it together


Once the trampolines have been set up with the proper instructions, we simply overwrite the target function’s first five bytes with a JMP to our replacement function (well, a JMP to the trampoline that jumps to the replacement function to be precise) and we’re done: the API is hooked. This step requires extra care however: aside from using VirtualProtect to enable writing to the target page (PAGE_EXECUTE_READWRITE) and then resetting the page protection to its original value we must also FlushInstructionCache after we’re done. In order to protect other threads in the application properly and ensure that we’re not overwriting a memory location where another thread is currently executing we also suspend all other threads before placing our JMP into the target function, and resume them afterwards. We go even further and compare each suspended thread’s instruction pointer against the target API. If the IP in the range where the JMP function will be placed (basically, the first 5 bytes of the API) then we resume the thread, let the IP advance, and suspend the thread again.

The following diagram is meant to illustrate a hooked function, the replacement function, the trampolines, and their relationship (the arrows indicate transfer of control):


Mhook in action
(click to enlarge)

The Disassembly


Part of the preparation for hooking an API is examining the target function’s disassembly and determining the first few instructions that will have to be overwritten. As mentioned earlier, this is required because Mhook lacks a built-in disassembler. I might provide automatic disassembly in a future version of the library, but until then, you have to identify the first few instructions of the target API manually and let Mhook know about them.

You will need to look at the disassembly of the target function and determine how long the overwrite region is going to be. The rules are simple:

  • We need at least five bytes.
  • It can be a bit more to allow for instruction boundaries: The library allocates 32 bytes for the trampoline but this must also contain the JMP to the rest of the hooked API.
  • None of the instructions in the region can be direct or indirect control transfer, such as JMP, Jcc, CALL, RET, etc.

The bytes you pick here will be copied to the trampoline, and their original location will be replaced by a JMP to the hook function (or, more precisely, a JMP to the crutch that jumps to the hook function). The bytes just copied to the trampoline will be followed by a JMP to the instruction beginning with the first byte that was not copied.

For example, on x86 XP SP2, kernel32!GetModuleFileNameW starts with something like this:

7C80B3D5  6a 28            push        0x28
7C80B3D7  68 30 b4 80 7c   push        0x7C80B430
7C80B3DC  e8 e5 70 ff ff   call        [eip+ilen-0x8F1B]

We’re not supposed to overwrite the third instruction (a CALL) but the two instructions preceding it give us 7 bytes of buffer space which is sufficient so we just copy those.

The generated trampoline code will contan the two PUSH instructions, followed by a JMP to the CALL at 7C80B3DC. The library will overwrite the PUSH instructions at 7C80B3D5 and 7C80B3D7 with a 5-byte JMP to the crutch, which in turn will jump to the replacement function.

Control transfer is not allowed in the overwrite region. Typically these instructions don’t appear in the first few bytes of any exported Windows API anyway so you should be in the clear.

CALL, RET, INT, SYSCALL, SYSENTER, JMP, Jcc (such as JNC, JEQ, JGE), etc. are dangerous – but it is worth noting that some of these control transfer instructions might be OK to overwrite. For example, a JMP with opcode E9 could be overwritten and copied elsewhere on x86, but not on x64. (E9 is absolute on x86 but is relative on x64.) A JMP with opcode EB cannot be overwritten and copied on either architecture as it is relative on both platforms. Conditional control transfers are always relative therefore they cannot be copied to the trampoline for execution. We could overwrite CALL, INT and any other instruction that causes the caller to – sooner or later - resume execution immediately after the instruction in question. However, we’d need to be very careful with the instruction immediately following one of these. Our thread suspension code ensures that the IP is not in the overwrite zone but it does not walk the stack to ensure that the IP will not eventually point in the overwrite zone as a subroutine or interrupt returns. And even if we did walk the stack to detect if a thread is in danger of returning to the wrong location, there’s nothing we could do to reliably nudge a potentially problematic thread out of a wait state if it is in one.

In order to keep things simple, avoid including control transfer instructions in the overwrite zone. Chances are you will not meet them in the first 5 bytes anyway.

Using the library


Mhook is quite simple to use, apart from the disassembly caveats described above. There are two functions you’ll need to call:

BOOL Mhook_SetHook	(PVOID *ppSystemFunction, 
                     PVOID pHookFunction, 
                     PBYTE pStoredSystemFunction, 
                     DWORD dwInstructionLength);

    Mhook_SetHook will set a hook in a specified API.

    PVOID* ppSystemFunction

    The first parameter to Mhook_SetHook is used for both input and output: it needs to point to a variable that stores the address of the function to be hooked. Upon successful return of the function this variable will contain the address of the trampoline that provides access to the original functionality of the hooked API.

    PVOID pHookFunction

    The second parameter to Mhook_SetHook is a pointer to the replacement function.

    PBYTE pStoredSystemFunction

    The bytes that you identified as safe to relocate and overwrite at the start of the original API.

    DWORD dwInstructionLength

    The number of bytes pointed to by the previous parameter.

    Return value: nonzero if successful, zero if an error occurrs.

BOOL Mhook_Unhook		(PVOID *ppHookedFunction);

    Mhook_Unhook will remove a previously set hook.

    PVOID* ppHookedFunction

    The one and only parameter to Mhook_Unhook is used for both input and output. It needs to point to the same location the first parameter of Mhook_SetHook pointed to. When the function returns with success, it will also overwrite the contents of the memory location addressed by this parameter, and instead of pointing at the trampoline, the memory will once again point to the original API.

    Return value: nonzero if successful, zero if an error occurrs.

Take a look at the sample code at the beginning of this article to see these calls in action.

Summary


If you don’t mind the limitations brought on by the manual disassembly process then Mhook can be a very useful tool.

Mhook is distributed under the MIT license. As usual, no warranties are implied or expressly granted.

You can download the source code from here: mhook-1.0.zip


Copyright (c) 2007, Marton Anka

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.



Future Improvements


Mhook is far from perfect. The following things should be addressed in the future:

  • Add a disassembler so no manual machine code analysis is required
  • Implement a memory allocator so one call to VirtualAlloc can service multiple hooks
  • Improve the thread-suspension code so it can deal with threads that are spawned during the execution of the thread-suspension process itself
  • Improve error handling so meaningful failure codes can be retrieved by GetLastError
  • For the truly paranoid: deal with possible conflicts with other hooking libraries (what if Mhook_SetHook is called on a function that is currently hooked with Detours, etc)
  • Add support for IA64 (Itanium)