Most of the techniques we see used in the wild leave enough artefacts that they can be detected reliably from user-mode. However, sometimes we come across a technique which is best analysed from kernel mode. One example is the “Gargoyle” code-scanner evasion.
Briefly, Gargoyle works by placing malicious code inside a non-executable area of memory. It then creates a system timer, configuring it to execute a ROP chain on expiry. The ROP chain calls VirtualProtectEx, marking the malicious code as executable, calls the malicious code, and then calls VirtualProtectEx a second time – this time, marking the malicious code as non-executable. The timer is then reinitialised and the cycle starts anew. There’s a longer example (complete with graphics) and a PoC at the author’s GitHub account.
It’s a really neat method, because using a ROP chain means there is no stub code to call VirtualProtect or suchlike. Since most investigators won’t look for code in non-executable areas of memory, there’s nothing to find. In fact, it’s something our very own MWR consultants have blogged about before with regard to Cobalt Strike integration.
To detect the attack, we decided to audit system timers and locate those that look suspicious. This means diving into some undocumented kernel code, so we fire up WinDbg and build the PoC. For the purposes of demonstration, we’re going to use a 32-bit version of Windows. Everything here also applies to the 64-bit world (and also to WoW64), but there are some extra steps which obscure the simplicity of what’s actually going on.
If you’re following along at home, any 32-bit Windows 10 VM should work. For reference, our screenshots are taken from Win 10 build 17134.191.
The first thing of interest is that WinDbg provides a method to enumerate system timers – the “!Timer” command. This saves us a lot of effort, revealing a lot of timers which the kernel is currently managing:
Each of these timers is of type KTIMER (see the Microsoft documentation). A common practice for MS is to extend documented structures with a larger undocumented structure, and that’s exactly what’s being done here - the KTIMER is actually a subset of the larger (and undocumented) ETIMER. Let’s examine one of the timers, interpreting it as an ETIMER:
As an aside, we can see the published KTIMER as the first element of the ETIMER struct. The most useful element for us, however, is the ‘TimerApc’. As you may have guessed, this is the APC that will be queued once the timer fires. We can examine it (just click the ‘TimerAPC’ hyperlink):
If we look in the relevant documentation, we see that the ‘NormalRoutine’ member is what will be executed once the APC is queued. Great! So we can go ahead and examine the handler:
Oh no! There’s no memory at that address! What’s going on here?! Why can’t we access the stack pivot? How can the kernel execute the payload once the timer fires? Well, it turns out that this is a good example of a kernel mechanism that is invisible from user-space, but that is important to be aware of when working in kernel-space.
As you may know, 32-bit Windows usually splits the 4GB address space in half, and uses the lower 2GB for kernel memory and the higher 2GB for user memory. The kernel half is kept mapped in to memory at all times, but the user-space portion of this, however, is not.
Each process gets its own version of this 2GB of virtual address space. When a process is scheduled on a processor, that processor is configured to use the correct 2GB of virtual address space. This is the mechanism which allows, for example, notepad and minesweeper to map a different module at the same virtual address independently. If you’re interested in learning more on the subject, a good starting point may be https://www.triplefault.io/2017/08/exploring-windows-virtual-memory.html and the indispensable “Windows Internals” book.
Anyway, to get back to the topic in hand - what happened in our analysis is that a different process is currently running, meaning that when we examine memory, we actually see memory in a different processes memory map. This makes sense – during our analysis, the gargoyle.exe process is dormant, waiting on a timer before it comes out of hiding.
We can confirm this hypothesis by checking what process is currently running via WinDbg:
This confirms it – the ‘System’ process is currently running (see the ‘image’ field). Fortunately, WinDbg has functionality to manipulate and select process contexts, which we will use now to observe the gargoyle handler.
You may have noticed earlier that the APC itself has a ‘Thread’ field. This, unsurprisingly, is the thread which the APC will be queued to once the timer fires. We can use to locate the correct process to switch to via the “!Thread” command:
Note the image – “Gargoyle.exe”. This is a thread from our Gargoyle image. Also shown is the ‘owning process’, which is the address of an EPROCESS structure. We can instruct windbg to use the memory ranges allocated to this owning process, and then we’ll be able to see the completion handler as expected:
That’s the stack pivot used by Gargoyle. We can also observe the minimal ROP chain, by examining the parameter passed to the timer function. This is located in the NormalContext field of the APC and observing the second dword which is transferred into ESP by the stack pivot:
Here, we can see a call to VirtualProtectEx(-1, 0x00f30000, 0x00001000, 0x00000020, 0x00f70054), which will return to 0x00f30000 (at the top of the stack!), which is where the gargoyle code itself lives.
With this analysis complete, we have enough information to detect the attack manually.
Detecting the attack manually is useful, but for real-world IR, a semi-automated solution is much more practical. We turned to Volatility for this, writing a plugin to detect hidden code.
Fortunately, Volatility comes with a ‘timers’ plugin, which lists system timers in a similar fashion to WinDbg. One word of warning, though – there is a bug affecting 32bit systems in the current version of Volatility. If you cannot detect your timer, I’d advise using our updated timers plugin until an official fix is available.
With our modifications, the ‘timers’ plugin is good to go. Our plugin uses it to obtain timer information, and then perform much the same steps as above – gathering APCs and observing their completion routine via the NormalRoutine member.
To automate things a little more, we need a way of assessing each timer’s completion routine. The natural way is to disassemble the first few bytes of the completion routine, and alert the operator if the usual x86 prolog is not present. This is a simple check to carry out, and so we do this, but we also attempt to get a higher quality classification by emulating the completion routine and observing its actions.
To emulate completion routines, we use the Unicorn engine. This is a CPU emulator which allows us to execute each instruction individually, examining the system state as we progress. If we see certain suspicious behaviours – such as a call to VirtualProtectEx – we can report a potential attack with higher confidence.
Unicorn is fairly easy to use for this task. First, we set up the emulated environment, allocating a stack and preparing for the APC handler to run:
unicornEng = Uc(UC_ARCH_X86, UC_MODE_32) # Populate the context from which to start emulating. # We use an arbitrary ESP, with a magic value to signify that the APC handler has returned. initialStackBase = 0xf0000000 unicornEng.mem_map(initialStackBase, 2 * 1024 * 1024) unicornEng.mem_write(initialStackBase + 0x100 + 0, "\xbe\xba\xde\xc0") # We push the argument which the APC handler is given. unicornEng.mem_write(initialStackBase + 0x100 + 4, apc.NormalContext.obj_vm.read(apc.NormalContext.obj_offset, 4)) unicornEng.reg_write(UC_X86_REG_ESP, initialStackBase + 0x100)
One thing to note is that, instead of copying the whole process address space to the emulated address space, we load it on-demand via a callback:
# Set up our handlers, which will map memory on-demand from the debuggee unicornEng.hook_add(UC_HOOK_MEM_READ_UNMAPPED, self.badmem) unicornEng.hook_add(UC_HOOK_MEM_WRITE_INVALID, self.badmem) unicornEng.hook_add(UC_HOOK_MEM_FETCH_UNMAPPED, self.badmem) def badmem(self, uc, access, address, size, value, user_data): # Unicorn will only successfully map page-aligned addresses, so map the whole page. pageSize = 0x1000 pageBase = address & amp; (~(pageSize - 1)) uc.mem_map(pageBase, pageSize) # Read from the debuggee.. pageCts = self.pas.read(pageBase, pageSize) if pageCts == None: self.dbgMsg("Unable to read %s bytes at %s" % (hex(pageSize), hex(pageBase))) raise MemoryError # And write to Unicorn. uc.mem_write(pageBase, pageCts)
Our main loop performs the emulation and looks for anomalous situations:
while instrEmulated < 10000: unicornEng.emu_start(nextIns, nextIns + 0x10, count = 1) # If we're now at our magic address, then our APC has completed executing entirely. That's all, folks. if nextIns == 0xc0debabe: break if nextIns == VirtualProtectEx: < ... omitted ... >
Once the emulated code calls VirtualProtect/Ex, we look at the arguments on the stack and note the address of memory which is being modified. If we later see a branch to this memory – that’s a definite sign of hidden activity!
Finally, our plugin is ready for use! We build the Gargoyle PoC, and run it. Then, we pause the VM and use the resulting system state as input for Volatility:
Note that the plugin displays the function prolog, to assist the operator, and also parses the probable location of the Gargoyle payload in memory – in this case, 0x01260000 in the process ‘Gargoyle.exe’.
While this approach will catch most Gargoyle-style attacks, there are certainly ways to defeat it. Most notably, our approach will detect code which is executed by a system timer as used in the original Gargoyle proof-of-concept code. While a system timer is the most elegant way of waiting for a period of time, there are other methods an attacker may be able to use to the same effect, such as asynchronous file or pipe IO.
Also, our emulation-based method of detecting ROP isn’t 100% reliable. Since the Unicorn engine can’t emulate every part of a full system, emulation may fail on some unusual timer handlers. Also, the Unicorn emulator currently lacks support for memory mapping. This has the practical effect that code sequences containing segment overrides (such as the ubiquitous “mov eax, FS”) will reference incorrect memory ranges and cause a failure to emulate. This results in the Volatility plugin classifying the timer as “Unknown”:
An operator can then perform further analysis, using the displayed function prolog as a starting point.
Gargoyle is a good example of a technique best detected from kernel space. We started out faced with a quiet implant which was difficult to detect, walked through manual analysis to determine how best to detect, and finished with an automated detection tool, able to detect real-world attacks.