We previously released an analysis of the kernel mode payload of the DOUBLEPULSAR payload that detailed how the kernel mode shellcode loaded a DLL into a target process using an Asynchronous Procedure Call (APC). However, the kernel payload did not actually perform the DLL load, but rather set up the APC call with some user mode shellcode that would perform the load.
This attracted our interest as the payload worked with any arbitrary DLL, but did not make use of the standard LoadLibrary call. Avoiding LoadLibrary can make the load more stealthy as it avoids the need to write the DLL to disk, can avoid anything monitoring LoadLibrary calls, and can also avoid having an entry in the Process Environment Block (PEB), which is usually how a list of loaded modules is obtained. Such techniques are now fairly common place, but up to now we were not aware of any public code that could load an arbitrary DLL in this way – existing code requires the DLL to be custom built to support being loaded. DOUBLEPULSAR is different in that it implements a more complete loader that can load almost any DLL. In addition, it works on almost any version of Windows as-is.
This article details the technique used by this user mode part of DOUBLEPULSAR and provides a test utility (available here) that can make use of the shellcode in a standalone form so it can be easily seen in action and detection mechanisms tested against it. The utility does not use any kernel mode code, and simply makes use of the user mode loader to inject an arbitrary DLL into a target process. While there is a 32-bit version of this shellcode as we have only analyzed the 64-bit version at this time.
A high level breakdown of the steps taken by the shellcode are given below. A more detailed analysis follows.
- A call-pop is used to self-locate so the shellcode can use static offsets from this address.
- Required Windows API functions are located by matching hashed module names, and looping through and exported function to match hashed function names.
- The DLL headers are parsed for key metadata.
- Memory is allocated of the correct size, at the preferred base address if possible. Any offset from the preferred base address is saved for later use.
- Each section from the DLL is copied into the appropriate offset in memory.
- Imports are processed, with dependent libraries loaded (using LoadLibrary) and the Import Address Table (IAT) is filled in.
- Relocations are processed and fixed up according to the offset from the preferred base address.
- Exception (SEH) handling is set up with RtlAddFunctionTable.
- Each section’s memory protections are updated to appropriate values based on the DLL headers.
- DLLs entry point is called with DLL_PROCESS_ATTACH.
- The requested ordinal is resolved and called.
- After the requested function returns, the DLL entry point is called with DLL_PROCESS_DETACH.
- RtlDeleteFunctionTable removed exception handling.
- The entire DLL in memory is set to writable, and zeroed out.
- The DLLs memory is freed.
- The shellcode then zeros out itself, except for the very end of the function, which allows the APC call to return gracefully.
Towards the start of the shellcode we see a call-pop combination, where the next instruction is called and the return address immediately popped into a register. This allows the code to find its own address, and to use a static offset from this to find data in its own buffer.
We then enter a loop that takes values starting at offset 0xF0C in the shellcode and passes these to a function through rdx and rcx. This function, which we will call find_func, locates Windows API functions that the rest of the shellcode will need. The values it receives are a hash of the module name (in rdx) and a hash of the function name (in rcx). As these are just names, they can be hardcoded and will not change in different Windows builds.
It locates loaded modules from the PebLdr field in the Thread Environment Block (TEB), loops through each one hashing its name to look for a match. Note that the module name hash is passed in through rdx and is pushed to the stack as part of a normal function prologue, and the shellcode accesses it on the stack from then on.
When a match is found, it moves on to a function matching loop, which operates in much the same way, but using exported function names from the exports table of the module. This is achieved by parsing the headers of the module in memory to find the IMAGE_DIRECTORY_ENTRY_ARRAY which contains Relative Virtual Addresses (RVAs) to various image directory entries, including IMAGE_DIRECTORY_ENTRY_EXPORT which holds export information. The RVA can be converted to the real address of these items by adding them to the base address of the module in memory. This parsing of PE headers is used extensively throughout the shellcode.
The IMAGE_DIRECTORY_ENTRY_EXPORT structure contains RVAs to various arrays. AddressOfFunctions is an array of RVAs to the exported functions themselves. AddressOfNames is a parallel array of the ASCII names of these functions. AddressOfNameOrdinals is another parallel array containing ordinal information about the functions. These arrays can be iterated over with the function names being hashed to look for a match, and when found the address of the function can be resolved and saved.
The function pointers are saved in a local structure on the stack. This structure is used to store various things as the shellcode executes including these function pointers. It is generally accessed via rsi. The format of the structure is shown below, which also shows which functions are resolved by this loop (other values are initialized later on).
The function pointers are written to this struct after each call to find_func here:
You will also see another structure referenced usually through rbp, which exists in a blank chunk of memory in the shellcode buffer, initially just containing zeros, which is used as a scratch space. The structure shown below shows the format of what is stored where in this memory, which starts at offset 0x368 into the shellcode.
After locating the functions which the shellcode needs, the DLL is copied into a newly allocated bit of memory, and zeroed out from the shellcode buffer itself. This will not be the location where the DLL is properly loaded into, but just a temporary location holding the raw data of the DLL. This makes use of one of the previously located functions, kernel32!VirtualAllocStub (which is a simple wrapper around VirtualAlloc).
The size of the DLL comes from a value written at the end of the shellcode, in between the shellcode and the DLL in the shellcode buffer. This is one of just 2 values which are customised in the shellcode, the other being the ordinal that should be called on the DLL once loaded. The layout of the shellcode buffer looks like this:
These values are referenced relative to the address obtained by the self-locating pop-call instructions at the start, which placed the address of offset 0x25 in the shellcode buffer into rbp. Thus we see the size for the DLL memory allocation coming from [rbp+0xF5D] which is 0x25+0xF5D, or 0xF82 offset into the shellcode buffer.
The DLL’s headers are then parsed to verify that the DLL is the correct architecture (32-bit vs 64-bit). If it is the wrong architecture the shellcode stops further processing to avoid any errors. Some useful values from the header are also saved for later use, such as a pointer to the SectionHeaders and the IMAGE_DATA_DIRECTORY_ARRAY which are used later on in the loading process.
Space is then allocated for the DLL to be properly loaded into. The size for this area does not come from the size of the DLL as it would take on disk, but from the SizeOfImage value in the DLL’s headers which refers to the size it will take when properly loaded into memory. The preferred base address from the headers is requested in the VirtualAlloc call, but if this is not available then the space is allocated somewhere else.
The offset from the preferred base address is stored for later use in relocation.
The DLLs headers are then copied into the new memory area, by copying from the start of the DLL with a size from the SizeOfHeaders header field. After this, each section is identified and copied into the correct location in memory in a loop. This uses header fields like NumberOfSections to iterate over all the section headers, and from each section header PointerToRawData, SizeOfRawData and VirtualAddress are obtained which are used to locate the raw data of the section in DLL buffer and copy it into the correct location for the loaded DLL.
The imports are now loaded. At the start of this process, another region of memory is allocated, but this is only used by a function which appears to be unused. It is possible that it could be legacy functionality that is no longer required.
The Import Table is parsed and each library is loaded with LoadLibrary. This is of course less stealthy, but we assume the user can either avoid dependent libraries, or only use Windows libraries that are more likely to be ignored as legitimate libraries in a process. The import tables are located using IMAGE_DIRECTORY_ENTRY_IMPORT entry in the IMAGE_DIRECTORY_ENTRY_ARRAY, a pointer to which was saved earlier when the DLL’s headers were initially parsed. The import table is then walked, resolving the offset to each library name string and calling LoadLibrary on it.
Each function imported from the library is them identified using the FirstThunk value from the import table, which has an offset to an IMAGE_IMPORT_BY_NAME structure, or an ordinal depending whether functions are imported by name or ordinal. The value of FirstThunk is checked against the bitmask 0x8000000000000000, which is the IMAGE_ORDINAL_FLAG64 mask; if set this value contains an ordinal in the lower bits, and if not set then it will be an import by name and the offset to the function name string can be located. GetProcAddress is then called to resolve the function address.
During this process the scratch space referenced by rbp is used to save certain things temporarily, such as latest FirstThunk value and the offset to IMAGE_IMPORT_BY_NAME.
The address is then written back to FirstThunk in the import table as a bound import.
We also see a call to the function that makes use of the mystery memory that was allocated earlier. However, this call can never be reached as before it is a call to a function that just sets rax to 1, and a jump over the call if rax is not 0. This is very unusual, and as it appears to be an unused function it was not analyzed at this time.
After processing imports, the image base in the headers of the loaded DLL is updated to reflect the real base address we ended up loading at.
Relocations are now dealt with, to account for any offset from the preferred base address of the DLL. IMAGE_DIRECTORY_ENTRY_BASERELOC is found from the headers, and used to iterate through all the relocations and fix them up if necessary. Only relocations of type IMAGE_REL_AMD64_ADDR32NB, IMAGE_REL_AMD64_SECTION and IMAGE_REL_AMD64_ABSOLUTE are handled (and ABSOLUTE is ignored by design anyway).
Each relocation table is iterated through, taking each Block entry and checking the first 4 bits to determine the relocation type. Depending on the type, the last 12 bits are used as the relocation value, and the relocation is found and updated based on the offset from the preferred base address.
Exceptions are set up using IMAGE_DIRECTORY_ENRTY_EXCEPTION and a call to RtlAddFunctionTable.
Each section then has its memory protections updated to the appropriate values from the SectionHeaders with a call to VirtualProtect.
The DLL entry point is then located using AddressOfEntryPoint in the headers, and is called with DLL_PROCESS_ATTACH (which has the value 1) to let the DLL know it is loaded. This completes the loading process.
With the DLL now loaded, the ordinal requested to be called is resolved to a function, and the function is called. The ordinal value is found using the address from the self-locating call-pop instructions, just like the DLL size before, but is found at offset 0xF86 in the shellcode buffer.
The ordinal Base value is obtained from the DLL headers, this is the start value for ordinals in the library. Subtracting this from the requested ordinal provides the index into the exported functions array AddressOfFunctions for the required function. This provides the RVA to the function which can be added to the image base address to get the real address of the function.
The function gets some stack space and arguments in registers rcx, rdx and r8. Of course if the function accepts no arguments, the presence of these arguments and stack space will not make any difference to the function, but any custom written DLL may wish to take advantage of these. The return value is saved on the stack by the shellcode, although it is never actually used for anything.
After the function returns, cleanup begins. The DLL is unloaded and (most) things are zeroed out in memory. First the DLL entry point is called with DLL_PROCESS_DETACH (which has a value of 0).
Exceptions are cleaned up with RtlDeleteFunctionTable.
The loaded DLL’s memory is made writable so it can be zeroed out, and then the memory is freed. In the course of this the VurtualProtectStub call requires a pointer to a writable location for the lpflOldProtect parameter, though we do not care about its value a pointer must be provided, so a location in the scratch space referenced through rbp is provided.
The shellcode then zeroes itself out, all except for a small function epilogue which must remain to allow the function to return properly. A side effect of this is a small memory artefact after exploitation.
After a DLL has been executed by DOUBLEPULSAR there are still a couple of memory artefacts left behind that can be detected. Firstly, the memory that is allocated at the beginning which the DLL is copied into from the shellcode buffer is never zeroed out or freed. This is a little bit unusual as efforts are taken elsewhere to remove as much as possible from memory, yet this memory region is left there with PAGE_EXECUTE_READWRITE permissions and a full copy of the DLL.
In fact, this memory area does not even need execute permissions as it is just used for read and write. This is especially unusual because the combination of read, write and execute is quite suspicious as this is rarely needed by legitimate processes and is usually only seen in the course of exploitation. That fact that it also starts with an unmodified MZ header is even more suspicious. Furthermore, it does not appear that this memory area is really needed at all, as the DLL could just be loaded directly from the shellcode buffer, with the self-locating call-pop instructions being enough to locate the DLL.
It may be that this memory region is a legacy thing from an older, less sophisticated reflective load, and was never properly refactored when the newer technique was coded up. Regardless, you will have in memory a suspicious memory region containing a copy of the DLL which was executed on the host. You may also see several of these memory regions if the payload was executed more than once.
The other memory artefact is one that is more difficult to avoid. This is the function epilogue of the main function of the shellcode which is executed by the APC call. This is needed to ensure the function can return cleanly and avoid crashing the process, and although there are ways to make it smaller it would be challenging to avoid any minor trace.
What you will see then is a memory region with PAGE_EXECUTE_READWRITE containing mostly zeros. At offset 0xF70 you will see:
f3 aa 58 41 5f 41 5e 41 5d 41 5c 5e 5f 5d 5b c3 eb 08
After this will be two 32-bit integers, the first of which is the size of the DLL injected, the second is the ordinal that was executed in the DLL. These two values will of course be different each time, and are not therefore suitable for a static signature. They could however be used to work out which DLL was executed (as it will still be mapped in the other RWX memory region with that size) and what function was executed in that DLL.
Using this information you will be able to obtain artefacts from memory that can reconstruct a large part of the attack that was carried out using the DOUBLEPULSAR backdoor.
Test the Payload with DOUBLEPULSAR-usermode-injector
We have released a small utility that can be used to invoke the usermode DLL loading mechanism of the DOUBLEPULSAR payload in order to test detection mechanisms and perform further research. This will use the shellcode to inject a DLL into a process of your choice, entirely from usermode.
The utility is available here.
Interestingly, the shellcode is generic enough that it can be triggered in various ways. As an example, the utility will queue a usermode APC in a similar way to the kernel payload, but can also trigger the shellcode using CreateRemoteThread in a similar way to more common DLL injection techniques (but still avoiding LoadLibrary).
The screenshot below shows the tool in use, injecting a DLL that pops up a message box into a calc.exe process.
After execution, two memory pages are seen that have PAGE_EXECUTE_READWRITE that are not associated with an ordinarily loaded DLL image.
One of these regions we can see is mostly filled with zeros, but with the small epilogue artefact at 0xF70. The other contains the original raw DLL.