My Emulation Goes to the Moon... Until False Flag

Jun 28

Written By a b

Introduction

Modern cyber security defenses rely on realistic adversary emulation to improve the security posture of an organization.

However, achieving true realism in malware emulation presents a significant challenge that goes well beyond a simple functionality replication.

It's not just about triggering alerts or mimicking network protocols. Effective adversary emulation requires reproducing the subtle quirks, behaviors, and—most critically—the obfuscation techniques that make real-world malware so challenging to analyze.
This level of accuracy is what distinguishes a genuine emulation from a mere simulation, and it's precisely what defenders need to test their capabilities against sophisticated adversaries.

Our research journey began with a comprehensive study of the obfuscation techniques employed by various APT groups. We then replicate many of the approaches to better understand the underneath mechanics. Through this iterative process, we developed the experience and tooling necessary to tackle increasingly sophisticated protection schemes.

Recently Mandiant published a very compelling analysis of APT41's Scatterbrain obfuscator, where researchers didn't just provide a high level description of the obfuscation scheme. Rather, a complete analysis of the core mechanisms was published, and crucially, a fully functional deobfuscator was also open sourced.

Scatterbrain

Scatterbrain is a sophisticated obfuscator used by APT41, a prolific Chinese threat actor whose techniques have been extensively studied by security researchers. The obfuscator represents a significant evolution in code protection techniques, implementing unprecedented methods that fundamentally alter program execution flow.

Despite its critical role in modern malware, obfuscation is often overlooked in adversary emulation efforts—primarily due to the difficulty of accessing and understanding the original obfuscation methods. However, studying and replicating these techniques provides invaluable insights for both offensive and defensive security operations.

Our goal was ambitious: create a pixel-perfect replica of Scatterbrain's obfuscation that could pass validation against existing analysis tools.

Scattebrain Techniques

The Scatterbrain compiler obfuscator implements several obfuscation techniques, but the most unprecedented one is the instruction dispatcher. This technique represents the main component of all protection and it is rooted in its execution flow.

Instruction Dispatcher

Rather than allowing code to execute in predictable sequential patterns, it segments executable code into disjointed basic blocks, each ending with a uniquely crafted dispatcher routine. Here's how this sophisticated mechanism operates:

Each basic block terminates with a call to a block-specific dispatcher function
Immediately following each call lies a 32-bit encoded displacement value

Disassemblers attempting to decode these displacement bytes as instructions fail, producing instead garbled output.

Dispatcher Mechanism

Each dispatcher routine follows a carefully crafted sequence:

push rsi                   ; Preserve working register
mov rsi, [rsp+8]           ; Retrieve return address from stack
movsxd rsi, [rsi]          ; Read and sign-extend 32-bit displacement
pushfq                     ; Save processor flags
xor rsi, 0xdeadbeef        ; Decode using obfuscator-specific key
sub [rsp+16], rsi          ; Update return address with decoded result
popfq                      ; Restore processor flags  
pop rsi                    ; Restore working register
retn                       ; Jump to calculated next instruction

The calculated offsets determine subsequent execution addresses through varying sequences of arithmetic and bitwise operations (XOR, ADD, SUB, OR), making static analysis tedious. Each dispatcher has different:

Working registers (RAX, RBX, RCX, RDX, RSI, RDI, R8-R15)
Decoding operations and constants

To further disrupt analysis frameworks, each dispatcher instruction is followed by a pair of jump if condition is met with opposite conditions targeting the same destination:

jz next_instruction       ; Jump if zero
mov reg, reg              ; Dummy instruction
jnz next_instruction      ; Jump if not zero 
next_instruction:

This effectively creates an unconditional jump because one branch is always taken and the destination address is the same.

Import Protection

The second layer transforms binary dependencies through dedicated stub dispatchers that obscure API usage. Rather than direct imports, Scatterbrain employs a simple import by hash with a twist: in addition to the resolver routine, which in this case simply makes use of LoadLibrary and GetProcAddress, a stub routine and a dedicated struct are used for each import.

The struct contains:

The RVA of the encrypted DLL name exposing the function
The RVA of the encrypted API name
A pointer where to store the address once resolved

struct obf_imp_t {
    uint32_t CryptDllNameRVA;
    uint32_t CryptAPINameRVA;
    uint64_t ResolvedImportAPI;
};

Every stub routine is tailored to a specific import and uses the respective import data structure, passed as an argument via a lea. This means that all API uses reference the same import data structure.

DLL and API names are encrypted via XOR with a stream cipher generated by a Linear Congruential Generator, which operates according to the following formula:

$$X_{n+1} = (a * X_n + c)\ mod\ 2^{32}$$

Where:

a is always 17
c is a 32-bit constant specific to sample
mod 2^32 to limit values in the 32-bit range

The decryptor routine uses the first 4 bytes of encrypted data to initialize the generator, then iteratively produces XOR keys for decryption.

The decryption terminates when a special condition is met: when the current encrypted byte equals the sum of the first 4 bytes of the previous LCG state.

Here's the resolver routine:

uint64_t ObfImportResolver(struct obf_imp_t *imp) {
  // Check if api is already resolved
  FARPROC ResolvedApi = imp->ResolvedImportAPI;
  if (!ResolvedApi) {
    char *DllName = ImpDecryptStr(imp->CryptDllNameRVA);
    if (DllName) {
      HMODULE dll = LoadLibraryA(DllName);
      if (dll) {
        char *ApiName = ImpDecryptStr(imp->CryptAPINameRVA);
        if (ApiName) {
          ResolvedApi = GetProcAddress(dll, ApiName);
          imp->ResolvedImportAPI = (uint64_t)ResolvedApi;
        }
      }
    }
  }
  return (uint64_t)ResolvedApi;
}

Replication

Replicating sophisticated, compiler-level obfuscation requires robust tools that allow extensive code manipulation. Despite not knowing what APT-41 used, we chose LLVM for our implementation due to its powerful intermediate representation (IR) and maturity.

Our approach leverages custom compiler passes that transform code at the IR level, using an out-of-source methodology that doesn't require recompiling the entire LLVM infrastructure. Instead, we dynamically load our obfuscation pass, against an existing LLVM installation.

Passes are the cornerstone of the compiler because each of the available optimizations is implemented through them. They operate on the given source code, specifically the SSA IR form, and through various iterations, where they strip out dead code and remove redundant lookups, they obtain the final binary.

Our obfuscation pipeline first converts the given source code to its IR form through clang, then loads the shared object (representing the obfuscation pass) and uses opt to apply the obfuscation and finally (cross-)compiles the modified IR code to PE and x86 native code.

image_pipeline_png

Engineering the instruction dispatcher

The dispatcher implementation presented several technical challenges that pushed the boundaries of what's possible with LLVM's IR manipulation.

Relocating the instructions

It is first necessary to create for each IR instruction its own dedicated basic block, move it within it, and then branch them accordingly. Ideally, by doing so each dispatcher will protect a single instruction although we will explain later why this is not the case.

To accomplish this, we leverage the splitBasicBlock method of each LLVM Function to dichotomously split existing blocks. By calling this method, we split each block in two at each instruction, separating it from the succeeding ones. We repeat this process until we reach the terminator. Because of the way the method works, there is no need to worry about moving or cloning instructions, updating operands, or connecting blocks via branch instructions. LLVM handles all of that, which is precisely what we would have had to do if we had relocated them manually.

Creating the dispatcher

The next step involves creating the dispatcher functions and integrating them with the opposing-conditional-jumps. We make extensive use of inline assembly to replicate the exact instruction sequences. Using LLVM's InlineAsm feature allows us to embed assembly code directly in the IR, maintaining precise control over the generated instructions.

The dispatcher functions first select a random register of choice from those available and a 32-bit encoding key via two simple helper functions written by us.

std::string Register = GetRandomRegister();
std::string Key = std::to_string(GenerateRandomKey());

These strings are used to randomize the dispatcher instructions. They are stored as strings that serve as parameters for the InlineAsm LLVM Object:

const std::vector<std::string> CompleteAsmString = {
    "push " + Register + "\n\t",
    "mov " + Register + ", [rsp + $$8]\n\t",
    "movsxd " + Register + ", dword ptr [" + Register + "]\n\t",
    "pushfq\n\t",
    "xor " + Register + ", 0x" + Key + "\n\t",
    "add [rsp + $$16], " + Register + "\n\t",
    "popfq\n\t",
    "pop " + Register + "\n\t",
    "retn"
  };

Using a vector instead of a single string allows us to output a pair of opposing conditional jumps after each instruction, both of which are emitted as InlineAsm.

In order to provide a valid destination to both jcc, i.e., the next dispatcher's instruction, we make use of BlockAddress which acts as a placeholder of the address of the basic block during IR manipulation, when the addresses are not yet known. In LLVM IR, there is no concept of final addresses because it is an intermediate representation designed to be target-independent. Address allocation and layout are handled later by the backend during code generation, depending on the target architecture and ABI.

The procedure for creating such jcc involves selecting a type of jcc: jump if zero, jump if greater, jump if below etc. and its negated via a function that works similarly to the one that selects the working register. The inline asm syntax allows the creation of jcc parametric with respect to its target, which will be passed as an argument to the InlineAsm call.

auto [Jcc, NegatedJcc] = GetRandomJCC();

std::string JmpString =
  Jcc + " ${0:P}\n\t"
  + "mov %" + DummyRegister + ", %" + DummyRegister + "\n\t"
  + NegatedJcc + " ${0:P}\n\t";

The ${0:P} syntax is LLVM's inline assembly parameter substitution mechanism that allows us to pass the block address as an argument of the InlineAsm call. Now we can construct each dispatcher by alternating between the actual dispatcher instructions and the obfuscating jump sequences.

// For each instruction in the dispatcher sequence
for (size_t i = 0; i < CompleteAsmString.size(); ++i) {
    // Create the current dispatcher instruction
    std::string AsmString = CompleteAsmString[i];
    InlineAsm *Asm = InlineAsm::get(FunctionType, AsmString);
    Builder.CreateCall(Asm, {});
    
    // Follow it with the opposing-conditional-jumps
    InlineAsm *Jump = InlineAsm::get(JumpType, JmpString);
    Builder.CreateCall(Jump, {NextBlockAddress, NextBlockAddress});
}

This code creates two separate InlineAsm calls for each step:

The dispatcher instruction: Builder.CreateCall(Asm, {}) generates the functional instruction (push, mov, xor, etc.)
The jump pair: Builder.CreateCall(Jump, {NextBlockAddress, NextBlockAddress}) generates the conditional jumps with the same target address passed twice—once for each jump instruction. LLVM's parameter substitution mechanism (${0:P}) ensures both jumps receive the same resolved address.

However, there's a crucial LLVM IR constraint we must address: every basic block must end with a terminator instruction (branch, invoke, return, or unreachable). Since our anti-analysis jump pairs are implemented as inline assembly, LLVM doesn't recognize them as proper terminators. To satisfy this requirement and prevent branch folding at the same time, we convert each direct branch instruction after each jump pair to an indirect one:

auto *IndBR = Builder.CreateIndirectBr(NextBlockAddress);
IndBR->addDestination(NextBB);

To spice things up, as seen in the available APT41 samples, the dispatcher's basic blocks are randomly shuffled within the function. In LLVM, the shuffling is possible without altering the program's semantics due to its IR's Single Static Assignment (SSA) form. In SSA form, each variable is assigned exactly once, and every use of a variable is dominated by its definition. This property ensures that the relative order of basic blocks doesn't affect the program's correctness as long as the control flow graph remains intact.

image_bb_shuffle_png

The basic block positions are reflected in the final binary, but the program's correctness remains unaltered thanks to the SSA form, which ensures that references are unambiguous.

In order to be able to prevent the compiler from creating prologues/epilogues, applying optimizations and forcing inlining, (which could potentially disrupt our work) we mark the newly created functions with three attributes: naked, optnone and noinline.

image_dispatcher_ir_png

Here is the final result

image_dispatcher_graph_png

Computing displacements

This is where we ran into the limitations of the LLVM-based approach, especially when it came to working with intermediate representation. There are two conditions that need to be met to create the displacements:

computing the distance in terms of bytes between one (IR) instruction and the following
emitting that distance in the form of an x86 instruction

The first requirement might actually be solvable—we could potentially use LLVM's getPtrToInt to convert BlockAddress values to integers and calculate differences between them. However, we didn't investigate this approach extensively because the second requirement proved to be the real showstopper.

In traditional assembly, you can emit raw bytes using directives like .byte, .word, .long and .quad. However, these directives are static, meaning they require compile-time constants. You cannot pass runtime-computed values or parameters to them, like we did with opposing-conditional-jumps, which was precisely what we needed.

Because there's no way to know the addresses at the IR level, our first approach was to use a placeholder which will be overwritten once the addresses are known. The chosen placeholder also solves another problem that was already mentioned: branch folding: the compiler is smart enough to recognize that previously split blocks have no real reason to be separated and proceeds to merge them, defeating our efforts to assign each instruction its own dedicated basic block.

Again, the choice fell on changing each direct unconditional jump instructions into indirect unconditional jump instructions using LLVM's IndirectBrInst:

Value *Successor = BlockAddress::get(BB);
IndirectBrInst *IBR = IRB.CreateIndirectBr(Successor);
Br->eraseFromParent();

LLVM typically translates indirect jumps as predictable instruction sequences:

lea rax, loc_nextBB    ; 7 bytes
jmp rax                ; 2 bytes

This gives us a reliable placeholder of exactly 9 bytes, 7 bytes for the lea instruction and 2 bytes for the jmp instruction, which we will locate and overwrite post-compilation, that provides enough space to write the 32-bit displacement while preventing the compiler from branch folding.

image_indirect_br_scattebrain_png

48 8D 0D D3 EC 04 00                    lea     rcx, string
E8 76 30 02 00                          call    puts
E8 39 11 00 00                          call    dispatch18
48 8D 05 B3 FE FF FF                    lea     rax, loc_1400011F1
FF E0                                   jmp     rax

Post compilation phase

Once compiled into PE format, we use Python's pefile library and the capstone disassembler to:

1. Function Identification and Classification

By marking the dispatcher function as DLLExportClass in advance, we can easily parse the IMAGE_EXPORT_DIRECTORY to identify functions that require post-compilation modifications.

for exp in pe.DIRECTORY_ENTRY_EXPORT.symbols:
    if "dispatch" in exp.name.decode():
        dispatches.append(exp.address)
        continue
    functions[exp.name.decode()] = exp.address

Dispatcher functions are easily found by their naming convention and export status, while target functions, also marked as DLLExportClass, are collected separately for processing.

2. Dynamic Encryption Key Extraction

We disassemble each dispatcher function to extract the XOR encryption key used for displacement encoding

def find_encryption_key(dispatch):
    dispatch_func = pe.get_data(dispatch)
    md = Cs(CS_ARCH_X86, CS_MODE_64)
    md.detail = True
    for i in md.disasm(dispatch_func, 0):
        if i.mnemonic == "xor":
            return int(i.operands[1].imm)

This approach is crucial because each dispatcher uses a different randomly generated key.

3. Placeholder Detection and Displacement Calculation

The core processing loop scans each target function for our signature pattern—calls to dispatcher functions followed by the lea + jmp placeholder:

# After finding a call to dispatcher
if i.mnemonic == "lea" and i.operands[0].reg == X86_REG_RAX:
    next_rip = rva + i.address + i.size
    jmp_dest = next_rip + i.operands[1].mem.disp
    after_call = rva + i.address
    
# Then look for the corresponding jmp
if i.mnemonic == "jmp" and i.operands[0].reg == X86_REG_RAX:
    diff = jmp_dest - after_call
    res = diff ^ key  # Encode with dispatcher's key
    to_write = res.to_bytes(4, byteorder='little') + random.randbytes(5)
    pe.set_bytes_at_rva(after_call, to_write)

The script calculates the actual displacement between the call site and the target basic block, encodes it with the dispatcher's XOR key, and overwrites the 9-byte placeholder with the 4-byte encoded displacement plus 5 random bytes.

An important consideration here is that because we randomly shuffle basic blocks within functions, the displacement can be negative. The target basic block might appear at lower addresses than the current block, resulting in a negative offset. The displacement is stored as a signed 32-bit integer in two's complement format, which allows the x86 processor to correctly interpret both positive and negative values.

4. Leftovers Cleanup

We take the opportunity during post-compilation to clean up any traces that could reveal information about our obfuscation. Firstly, we target the export directory. Since we used exported symbols to identify dispatcher functions during processing, these entries remain visible in the final binary and could provide hints to analysts about our obfuscation structure. The script addresses this by completely clearing dispatcher function names and target function identifiers from the binary's metadata.

Secondly, we eliminate the indirect jump leftovers within dispatcher functions themselves. Recall that we used lea + jmp placeholders to prevent branch folding during compilation. After patching the target functions, these placeholder instructions in the dispatchers are no longer needed and have to be removed.

def nop_indirect_jmp(pe, dispatch):
    # Find and overwrite lea + jmp instructions in dispatcher functions
    # with random bytes to further obfuscate the binary
    dispatch_func = pe.get_data(dispatch)
    for i in md.disasm(dispatch_func, 0):
        if i.mnemonic == "lea" and i.operands[0].reg == X86_REG_RAX:
            # Found the placeholder sequence
            bytes_to_nop = lea_size + jmp_size
            pe.set_bytes_at_rva(dispatch + lea_addr, random.randbytes(bytes_to_nop))

Here's how the binary looks like before, during and at end of the obfuscation process

image_final_result_comparison_png

Here we go again

After successfully developing and testing this solution, we discovered a simple yet crucial fact about inline assembly (asm) directives, such as .word, that forced us to rework our design. Despite the fact that these directives do not take any parameters, they actually accept local labels. For example:

__attribute__((naked))
void f() {
  __asm volatile (
    ".long (1f)\n\t"
    "1:\n\t"
    "ret\n\t"
    :
    :
    :
  );
}

The address of the compile-time-resolved label 1 is emitted by the .long directive, right where the return instruction is, as a 32-bit value. Looking at the decompiled code confirms this.

0000000000400450 <f>:
  400450:	54                   	push   rsp
  400451:	04 40                	add    al,0x40
  400453:	00 c3                	add    bl,al

We can see the address 0x400454, encoded as little endian, of the ret instruction (c3), and the disassembler's incorrect interpretation of the bytes as instructions.

In addition to single label, they also accept operations on them such as sum or subtraction

.long(2f - 1f)
1f:
...
2f:

This allowed us to create a more elegant yet robust displacement calculation mechanism. By inserting a dedicated label after each dispatcher's call and at the beginning of the subsequent block, we can let the compiler compute the distance.

%entry:
  %2 = alloca i64, align 8
  call void @dispatch0()
  call void asm sideeffect ".long (((2f - 1f) + 4) ^ 63277)
                            1:"
  br label %bb

%bb:                                          ; preds = %entry
  call void asm sideeffect "2:"
  store i64 0, i64* %2, align 8
  call void @dispatch1()
  call void asm sideeffect ".long (((4f - 3f) + 4) ^ 28877)
                            3:"
  br label %bb2
...

As can be seen, we again make use of InlineAsm to emit both labels and the .long directive. Each function's basic block will then have two labels: one to reference its start and another to reference the return address.

In order to point to the return address, the displacement must be increased by 4, which corresponds to the size of the .long output. In addition, the previously generated dispatcher's encoding key is re-used to encode the displacement.

The .long inline asm string is computed by combining each of these elements into a single string passed as parameter to the InlineAsm call.

ss << ".long (((" 
  << end_ref                 // Successor's starting label
  << " - "                   // Subtraction
  << start_ref               // Current BB's ending label
  << ") + 4) ^ "             // .long size and XOR operation
  << std::hex << xor_key     // Encoding key
  << ")\n\t" 
  << start_label << ":\n\t"; // Current BB's ending label

Where end_ref and start_ref are respectively BB's ending label and BB's starting label, to whom the character f or b is appended. In fact a reference to a local label needs the suffix f or b depending on whether the referenced label is after (forward) or before the current instruction.

This posed a problem whenever the blocks were shuffled because we needed to know whether the reference was to a subsequent block or to a preceding one. We had two choices:

Assign labels and then shuffle
Shuffle and then assign labels

Although the second option was more tempting, it required traversing the control flow graph to determine the real successor, since the basic block order would no longer be representative. With the first approach, it is sufficient to keep track of:

the basic block labels
the current basic block order

This allows us to determine whether the successor is before or after the current block based on its label value.

BasicBlock* next_bb = Term->getSuccessor(0);
auto current_pos = std::distance(ShuffledOrder.begin(), current_it);
auto next_pos    = std::distance(ShuffledOrder.begin(), 
                                std::find(ShuffledOrder.begin(), ShuffledOrder(), next_bb));
        
bool is_forward_reference = (next_pos > current_pos);

Following is an (exemplified) extract of the generated assembly

f:                                      # @f
# %bb.0:
    sub	rsp, 104
    call	dispatch0
    .long	((.Ltmp0-.Ltmp1)+4)^15221
.Ltmp1:

# %bb.1:                                # %split
.Ltmp0:

    mov	edi, offset aStr1
    call	strlen
    mov	qword ptr [rsp + 80], rax       # 8-byte Spill
    call	dispatch2
    .long	((.Ltmp2-.Ltmp3)+4)^1542
.Ltmp3:

As can be seen and will later be discussed, despite assigning each IR instruction its own dedicated block and protecting each block with a dispatcher, the compiled output contains more than one instruction per basic block. Other than that, this solution was preferable due to its absence of operating at the post-compilation stage to calculate displacement.

Import Protection

As previously shown, the API import resolution routine is performed through LoadLibrary and GetProcAddress, which respectively load the given library and retrieve a pointer to the desired API function.

The encryption routine is only needed at compile time, during API obfuscation. However, the decryption routine must be present in the final binary for the import resolution routine to work correctly.

To accomplish this, we had more than one option:

Create new functions in the current LLVM bitcode and implement them both using the available API for creating IR instructions (long and tedious)
Create a separate file that will be compiled into bitcode. Then, parse it using LLVM's parseIRFile or parseIR to import the functions (need to handle external references)
Add a step to the toolchain to llvm-link the separate bitcode file containing said functions with the current bitcode

We chose the third option because it relieves us of the burden of importing the dependencies of selected functions, such as external functions, variables, structures, etc., as it practically merges the two bitcode's content into a single one.

Encryption scheme

image_encryption_scheme_png

The encryption scheme (and consequently decryption works as follows)

The first four bytes are extracted from a randomly generated number. These act only as a seed for the LCG and are not ciphertext
$X_{n+1}$ is computed as $(a * X_n + c)\ mod\ 2^{32}$ where $X_n$ is the seed value
The first four bytes of $X_{n+1}$ are summed to get the XOR-ing key
Input character is XOR-ed with the XOR key and result is appended to the output array
Repeat the process from step 2

When the string terminator is reached, the first 4 bytes sum of $X_{n+1}$ is appended to the output string to signal the string termination during the decryption phase. This tells the decryption routine when to append the terminating character \0 and end the process.

The decryption routine process that takes place at execution time, uses a function that performs the reverse process of the one just explained.

Gathering imports

Finding the imported functions is as simple as iterating over functions in the Module which represents a source file (or translation unit), and looking for ones tagged as DLLImportStorageClass (with the exception of LoadLibrary and GetProcAddress).

for (auto &F : M)
  if (F.isDeclaration())  
    if (auto *GV = dyn_cast<GlobalValue>(&F))  
      if (GV->getDLLStorageClass() == GlobalValue::DLLImportStorageClass)

Unfortunately, LLVM does not provide information about which DLL exports said function in the intermediate representation, so in order to retrieve the DLL (and encrypt its name), a compilation is needed to extract said information. To do so, a step in the obfuscation pipeline is added where the original source code is compiled to PE and via a Python script its information are extracted and saved:

with open(importmap, "wb") as f:
    for desc in pe.DIRECTORY_ENTRY_IMPORT:
        dll = desc.dll.decode("utf-8")
        for imp in desc.imports:
            name = imp.name.decode("utf-8")
            f.write(f"{name}:{dll}\n".encode("utf-8"))

Creating `obf_imp_t`

Once all the API imports informations are available, the next step is to create dedicated structures for each of them. The obf_imp_t structure definition is already present, thanks to the previously llvm-link-ing step, so we only need to retrieve it within the module.

StructType *ObfImpT = StructType::getTypeByName(M.getContext(), "struct.obf_imp_t");
  assert(ObfImpT && "obf_imp_t struct type not found");

For each import, both its DLL and API names are encrypted using the encryption function, which is accessible in the current LLVM pass module through the included header from the separate encryption routine file

auto *Fname = F->getName().data();
const char *EncAPIName = ImpEncryptStr(Fname);
const char *EncDllName = ImpEncryptStr(ImportMap[Fname]);

A GlobalVariable, a variable that exists at module scope and corresponds to global or static variables in source languages, of type ArrayType of 32 char initialized with the encrypted values is created for each of them.

ArrayType *EncArrayType = ArrayType::get(i8Type, 32);

GlobalVariable *EncDllNameGV = new GlobalVariable(
  M,
  EncArrayType,
  ...
  ConstantDataArray::get(Context, ArrayRef<char>(EncDllName, 32)),
);

GlobalVariable *EncAPINameGV = new GlobalVariable(
  M,
  EncArrayType,
  ...
  ConstantDataArray::get(Context, ArrayRef<char>(EncAPIName, 32)),
);

A third zero-initialized global variable is created to store the address of the API once it is resolved.

Once all members of the obf_imp_t struct have been created as shown above, we can initialize a new variable of this type. However, since we need to store their Relative Virtual Addresses rather than the values directly, we use getPtrToInt to obtain their addresses for initialization

ConstantExpr::getPtrToInt(EncDllNameGV, i64Type)
ConstantExpr::getPtrToInt(EncAPINameGV, i64Type)

Emitting the stub

The dispatcher stub for Scatterbrain imports is a simple function whose only purpose is to load the obf_imp_t of a given import, call the resolver routine and jump to the returned address.

All registers used are preserved by saving them before the call and restoring them afterward, ensuring that the arguments for the desired API, which are loaded into the registers, are properly passed from the stub to the actual API.

push rcx
lea rcx, [rip+obf_imp_t]
push rdx
push r8
push r9
sub rsp, 28h
call ObfImportResolver
add rsp, 28h
pop r9
pop r8
pop rdx
pop rcx
jmp rax

To emit the stub, we again use the InlineAsm feature, which allows us to issue inline assembly within the intermediate code to exactly replicate the structure of the stub dispatcher.

For each import, a stub function is created whose only instruction is a call to InlineAsm, parametric to the data structure loaded with lea and the resolver routine described earlier.

InlineAsm *Asm = InlineAsm::get(InlineAsmTy,
                                "push rcx\n\t"
                                "lea rcx, $0\n\t"
                                "push rdx\n\t"
                                "push r8\n\t"
                                "push r9\n\t"
                                "call ${1:P}\n\t"
                                "pop r9\n\t"
                                "pop r8\n\t"
                                "pop rdx\n\t"
                                "pop rcx\n\t"
                                "jmp rax"
);

CallInst *AsmCall = Builder.CreateCall(Asm, {Imp, Resolver});

The function is then marked naked, noinline and optnone to prevent the compiler from modifying it.

Finally, we need to replace the uses of the API with the stub we just created. To do this, simply iterate over the users of a Function in the IR to identify the locations where it is used, and change the operand of the CallInst to the stub function.

for (auto *U : Users) {
  if (CallInst *CI = dyn_cast<CallInst>(U)) {
      CI->setCalledOperand(Stub);
  }
}

Results

The results were validated using the de-obfuscator created by Mandiant for both the import protection and the instruction dispatcher technique. The obfuscation of both techniques was applied to individual sample functions, and a script very similar to the one available in the repository was used to selectively de-obfuscate by providing the protected functions addresses.

As the output below shows, the tool is able to identify numerous instruction dispatchers scattered throughout the binary, subsequently resolving and patching them, as well the obfuscated imports, as confirmed by their stub code beneath.

[ProtectedImage64::INFO]: processing input:
        Filepath: /tmp/scatterbrain.exe.patched
        Basename: scatterbrain.exe
        SHA256:   11AB5A3173B55442E6C1363EA6D8DEC6C66E618E0B2040E3598BF4BB4D049967
        MD5:      02965DC4D68FBF0538AC5AD8D6258D42
        Mode:     ProtectionType.SELECTIVE
[ProtectedImage64::INFO]: assuming a .data section (with that exact name) exits
[ProtectedImage64::INFO]: .data section at +0x005000
[ProtectedImage64::INFO]: Starting instruction dispatcher recovery
[ProtectedImage64::INFO]: Found 2317 potential_dispatchers
        Verifying further via emulation
[ProtectedImage64::INFO]: Recovered 2317 verified dispatchers
[ProtectedImage64::INFO]: Applying all resolved `jmp->dest` patches for each dispatcher
[ProtectedImage64::INFO]: Completed.
[ProtectedImage64::INFO]: Recovered 10 protected imports
[ProtectedImage64::INFO]: Recovered 1 protected functions.
[ProtectedImage64::INFO]: Initiating rebuild of deobfuscated binary result
[ProtectedImage64::INFO]: Successfully completed rebuilt container for deobfuscated binary
[ProtectedImage64::INFO]: Starting relocation rebuild given starting offset of 1000
[ProtectedImage64::INFO]: Applying fixups
[ProtectedImage64::INFO]: Completed rebuild of relocations
[ProtectedImage64::INFO]: Successfully created functional deobfuscated binary output
[ProtectedImage64::INFO]: Done
==========================================================================================

Below is a short video showing the IDA decompiler working on a binary before and after de-obfuscation to demonstrate the success of Scattebrain's obfuscation emulation.

video

The video shows three binaries. The first is an obfuscated test case. As seen in the beginning, the disassembler attempts to interpret the initial displacement as an opcode, but fails and stops disassembling the rest of the function. The second binary is the original source code that was compiled for reference. The third binary shows what happens after running the Mandiant tool on the first binary. It can be seen that the disassembly has regained its meaning.

Limitations

As mentioned earlier, we encountered the structural limitations of intermediate level code manipulation during our work. The instructions protected by a dispatcher are not one, but multiple: the translation from IR instruction to machine instruction is not necessarily always 1:1, and in fact, although each instruction has been moved within its own block, there are more than one instructions per basic block.

We believe that the multiple instances of InlineAsm and their non-IR code are responsible for the compiler's failure to optimize, which would have resulted to a more accurate translation.

For this reason, the feasibility of a pass that operates at a level closer to the machine level was explored, and luckily LLVM allows this thanks to MachineFunction passes that operate on the instructions right before they are inserted into the final binary. Unfortunately, there is no way to write MachineFunction passes out-of-source as there is for those running on IR, which forced us to recompile the LLVM llc tool each time the pass was changed.

The idea is to replicate what is done in the intermediate representation by inserting:

A starting label for each basic block.
A call to a dispatcher function (that was previously emitted).
the .long-based displacement calculation

We will spare the details because the process is similar to the one performed at the IR level, except for the constraint that dispatcher functions cannot be emitted and must already be present in the bitcode. A solution is to emit them at the IR level, keep track of them externally (perhaps in a JSON file with the dispatcher-encoding key association) and use said information in the MachineFunction pass.

A snippet of the result (using a single dispatcher function and no encoding/shuffling) is shown below:

4004a8:    c7 45 f0 00 00 00 00     mov    DWORD PTR [rbp-0x10],0x0
4004af:    e8 cc ff ff ff           call   400480 <dispatcher>
4004b4:    04 00                    add    al,0x0
4004b6:    00 00                    add    BYTE PTR [rax],al
4004b8:    48 bf b8 11 40 00 00     movabs rdi,0x4011b8
4004bf:    00 00 00 
4004c2:    e8 b9 ff ff ff           call   400480 <dispatcher>
4004c7:    04 00                    add    al,0x0
4004c9:    00 00                    add    BYTE PTR [rax],al

Although we gave up at this point because we had already spent far too much time, we can safely conclude that this is the most adherent way to achieve it and it is definitely one possible way in which APT41 may have realized the instruction dispatcher.

Some goodies

We spent as much time reversing the available APT 41 samples as we did analyzing the de-obfuscation tool, in order to better understand how certain details were handled. Thanks to this effort, we identified weaknesses that could be exploited to defeat the tool.

Instruction dispatcher

The de-obfuscator relies on a brute-force search to find any e8 near call statements. Potential dispatchers found are verified by looking for the pair of pushfq-popfq instructions.

calls = _brute_find_all_calls(d.imgbuffer, d.DATA_SECTION_EA)
    potential_dispatchers: list[int] = []
    for call_off in calls:
        try:
            if _verify_dispatcher_pushfq(d, call_off):
                potential_dispatchers.append(call_off)
        except Exception as _:
            continue

By inspecting _verifiy_dispatcher_pushfq we can identify another constraint. Dispatcher verification is limited to 15 dispatcher instructions.

MAX_SCAN_RANGE = 15

    while count < MAX_SCAN_RANGE:

Also, if any call or jmp or privileged instructions are encountered, dispatcher is considered invalid.

if (
    x86.X86_GRP_CALL in instr.groups or
    x86.X86_GRP_PRIVILEGE in instr.groups or
    instr.is_jmp() and (instr.is_op1_reg or instr.is_op1_mem)
):
    return False

So in order to prevent dispatcher de-obfuscation any of the following is possible:

Replace pushfq/popfq with an alternative
Create a dispatcher longer than 15 instructions
Insert a call/jmp in it

Alternative dispatcher

image_0we3s_png

lahf/sahf instructions (Load/Store AH into flags) transfer the low byte of the flags word to the AH register. The bits (lsb to msb) are: sign, zero, indeterminate, auxiliary carry, indeterminate, parity, indeterminate, and carry.

The overflow register, for jo jumps, is the only register used in conditional branches left out by lahf. It can be saved with seto al, but is harder to restore because there is no easy way to set the overflow register so we have to trigger it manually if it was set.

Below is an updated version of the dispatcher that bypasses all three de-obfuscator verification heuristics.

It begins by saving most of RFLAGS state using lahf in AH and seto al to specifically capture the overflow flag state in AL's least significant bit. Then it performs the dispatcher's main work with the XOR operation and stack manipulation.

The restoration process tests whether the original overflow flag was set using test al, 1. If the overflow flag wasn't originally set, the code skips directly to normal flag restoration. However, if the overflow flag was set, the code must manually trigger it.

The manual overflow recreation works by first storing 127 (0x7F) in AL, the largest one-byte number. Then, it increments to 128 (0x80), which causes a signed byte overflow and naturally sets the OF flag.

The code then uses sahf to restore all other flags before cleaning up the registers.

push    r15
mov     r15, [rsp+8]
movsxd  r15, dword ptr [r15]

; New code
push rax             ; save rax content
lahf                 ; overwrite rax 8 to 15 bit with RFLAGS
seto al              ; overwrite rax 0-8 bit with OF
;

xor     r15, 0xFAC0
add     [rsp+10h], r15

; New code
test al, 1           ; check if OF was set
jz continue          ; if not, no need to restore it
;

push rax             ; save rax content again, this time RFLAGS
mov al, 0x7F         ; move the biggest 1 byte number into AL
inc al               ; increment it, causing overflow
pop rax              ; pop rax containing other RFLAGS
jmp continue

continue:
sahf                 ; restore other RFLAGS
pop rax              ; restore original RAX
pop r15              ; restore current working register
ret

Following is the tool's output, failing to identify potential dispatchers in the same previous source code that was obfuscated with the new dispatcher variant.

[ProtectedImage64::INFO]: processing input:
        Filepath: /tmp/scatterbrain.exe.patched
        Basename: scatterbrain.exe
        MD5:      5907DF34595D2531D87A38FD8C0D2D93
        Mode:     ProtectionType.SELECTIVE
[ProtectedImage64::INFO]: assuming a .data section (with that exact name) exits
[ProtectedImage64::INFO]: .data section at +0x007000
[ProtectedImage64::INFO]: Starting instruction dispatcher recovery
[ProtectedImage64::INFO]: Found 0 potential_dispatchers
        Verifying further via emulation
[ProtectedImage64::INFO]: Recovered 0 verified dispatchers

Import protection

De-obfuscator brute forces the search for obfuscated import calls by looking for FF15 (indirect near call) and FF25 opcode (indirect near jmp).

def _brute_find_impstubs(
    d: ProtectedInput64
) -> list[int]:
    """ Internal routine that implements a brute-force scanner to recover all
    possible import calls within a given imgbuffer by byte-signature scan.

    Post-processing is done afterwards to fully ensure these are valid.
    """
    patterns = [bytes.fromhex('FF15'), bytes.fromhex('FF25')]

To generate the FF15 opcode, the operand of CallInst must be a global variable holding the address of the target function.

%4 = load i32 (i8*, i32)*, i32 (i8*, i32)** @TerminateThread_stub, align 8  
%5 = call i32 %4(i8* noundef %3, i32 noundef 0)

Which is translated to:

FF 15 9F 59 00 00       call    cs:stub_addr

Coincidentally, if the import is obfuscated within a function that is already protected by the instruction dispatcher, the translation will be more 'conservative'. In our opinion, this is always due to the presence of InlineAsm, which interferes with the compiler heuristics.

48 8B 05 BC 5A 00 00    mov     rax, cs:stub1_addr
31 D2                   xor     edx, edx
FF D0                   call    rax ; stub_1

This is enough to prevent the de-obfuscator from finding any imports in our binary.

Conclusions

In this blog post, we proposed an implementation of Scattebrain's obfuscation techniques in the context of adversary emulation, successfully reproduced it using LLVM, demonstrated its limitations and critical issues, validated our work with the Mandiant de-obfuscator, and proposed improvements to evade its recovery by the automated tool.

a b

My Emulation Goes to the Moon... Until False Flag

Introduction

Scatterbrain

Scattebrain Techniques

Instruction Dispatcher

Dispatcher Mechanism

Import Protection

Replication

Engineering the instruction dispatcher

Relocating the instructions

Creating the dispatcher

Computing displacements

Post compilation phase

Here we go again

Import Protection

Encryption scheme

Gathering imports

Creating obf_imp_t

Emitting the stub

Results

Limitations

Some goodies

Instruction dispatcher

Alternative dispatcher

Import protection

Conclusions

An unexpected journey into Microsoft Defender's signature World

Creating `obf_imp_t`