vectorize logo transparent

Advancing Cybersecurity: Introduction to the Scudo Allocator

By Vectorize

Table of Contents

The scudo allocator is a modern memory allocator designed with a strong focus on security. The name, derived from the Latin word for "shield," reflects its main purpose of protecting software applications against vulnerabilities related to memory issues. Scudo offers improved memory safety guarantees and helps mitigate the risks associated with memory-based attacks—proving to be invaluable in cybersecurity for law enforcement.

However, compared to older allocators, scudo has received limited public vulnerability research and exploration. In this series of blog posts, our goal is to bridge this gap by providing an in-depth analysis of scudo‘s internal workings, examining strategies for exploit development, and introducing helpful tools that aid in this process.

The first post is an in depth practical look at the allocator, with an introduction to the source code. It is code-centric but tries to abstract where possible.

Terminology

The terminology can vary between allocators, so to avoid confusion, here are a basic few terms we will explore later in full:

  1. Chunk: The space of memory returned to the user from a malloc call.
  2. Block: A fundamental unit of memory in scudo, comprising chunk metadata and a user data. This user data space is what the pointer from a malloc call references.
  3. Region: A contiguous memory area that stores blocks of a specific size.
  4. Cache: Common in many allocators, caches enhance memory access speed and data locality for faster allocations.
  5. Guard Pages: Sequential memory pages set without permissions, triggering a segmentation fault upon any access attempt (read/write).
  6. Primary Allocator: The mechanism for managing smaller memory allocation requests.
  7. Secondary Allocator: The mechanism for handling larger memory allocation requests.
  8. Transfer Batch: A group of free chunks assembled for efficient movement between the cache and the primary allocator.
  9. Quarantine: A mechanism of protecting blocks from immediate use-after-free.

Here is an abstract simplification of the structure to keep in mind when reading on:

scudo basics

The scudo allocator has some features that distinguish it from other allocators, among them are:

  • Limited metadata on the heap, with only chunk headers present.
  • Region isolation using guard pages to separate memory regions from each other.
  • Heap randomization, shuffling blocks within each region.
  • Integrity checks integrated into each block’s header.
  • Delayed freelist to prevent immediate chunk reuse and avoid use-after-free vulnerabilities (disabled and not covered here).

It is important to note that scudo is heavily customisable, and many of these features can be tweaked or disabled in specific configurations. Therefore the specifics of this deep dive can vary between processes and devices. I will also only focus on the 64-bit version of scudo, to keep things simpler and more relevant.

scudo consists of two allocator strategies: the Primary Allocator and the Secondary Allocator. The Primary Allocator handles smaller allocations up to 64KB, while the Secondary Allocator handles larger ones in a similar fashion to known allocators (eg. dlmalloc) making use of the mmap syscall.

Primary

The Primary allocator in scudo is responsible for servicing smaller allocation sizes. It carves reserved memory regions into blocks of the same size and manages these. There are two Primary allocators implemented, one for 32-bit architectures and another for 64-bit architectures. The Primary allocator for 64-bit architectures will be the primary focus of this blogpost.

Secondary

The Secondary allocator in scudo is responsible for servicing large allocation requests (everything above 0x10010), it uses mmap under the hood. The Secondary allocator will not be the focus of this blog post because finding primitives in large allocations is comparatively rare to smaller size objects. A blogpost exploring the secondary allocator will be released in the future. It is worth noting that the security features of the secondary allocator are more lax than the primary, and it should not be discounted as a vector of exploitation.

The scudo Source

The scudo allocator is officially maintained by the LLVM-project. The scudo source is C++, and it is not exactly a page-turner. Compared to the rather straightforward internals of the glibc ptmalloc2 implementation, scudo is riddled with abstractions and C++ features. Here is the wrapper for malloc in the source:

INTERFACE WEAK void *SCUDO_PREFIX(malloc)(size_t size) {
  return scudo::setErrnoOnNull(SCUDO_ALLOCATOR.allocate(
      size, scudo::Chunk::Origin::Malloc, SCUDO_MALLOC_ALIGNMENT));
}

This code is pretty dense, but mostly self explanatory.

  • The malloc wrapper function serves as an interface to the scudo allocator, providing a layer of abstraction for memory allocation operations.
  • The size parameter represents the size of the memory chunk to be allocated.
  • The SCUDO_ALLOCATOR refers to the instance of the scudo allocator responsible for managing memory allocations.
  • scudo::Chunk::Origin::Malloc denotes that the memory allocation originates from a malloc call.
  • SCUDO_MALLOC_ALIGNMENT specifies the alignment requirements for the allocated memory.

If you are reading the scudo source it is worth keeping a few things in mind:

  • Most of the scudo code is present in header files.
  • scudo makes heavy use of C++ features (classes, inheritance, …).
  • The source is easy to compile, putting in e.g. informative print statements can be a help when playing around with it.

Building and using the allocator on desktop is as simple as:

cd $LLVM/compiler-rt/lib
clang++ -fPIC -std=c++17 -msse4.2 -O2 -pthread -shared \
  -I scudo/standalone/include \
  scudo/standalone/*.cpp \
  -o $HOME/libscudo.so

ss
# Then preload the lib
LD_PRELOAD=$HOME/libscudo.so ./a.out

Here is an overview of some of the key files:

  • local_cache.h – implements the local cache, this is where control will be early in an allocation path.
  • primaryXX.h – contains the primary allocator implementations (32/64), this is where you will find the region initialization code.
  • chunk.h – chunk header layout, checksum computation.
  • combined.h – commonalities in 64/32 here, handles MTE, and dispatches requests to the local cache.

Initialization

Like most allocators, initialization is done when a program makes its first request for memory. This is done to avoid unnecessary overhead for programs not using the heap, as well as removing the need for modifying the initialization routines of the linker.

For 64-bit instances the initialization routine is implemented in primary64.h.

// primary64.h
void init(s32 ReleaseToOsInterval) NO_THREAD_SAFETY_ANALYSIS {
    DCHECK(isAligned(reinterpret_cast<uptr>(this), alignof(ThisT)));

    const uptr PageSize = getPageSizeCached();
    const uptr GroupSize = (1U << GroupSizeLog);
    const uptr PagesInGroup = GroupSize / PageSize;
    const uptr MinSizeClass = getSizeByClassId(1);
// ...

    // Reserve the space required for the Primary.
    // static const uptr PrimarySize = RegionSize * NumClasses;
    CHECK(ReservedMemory.create(/*Addr=*/0U, PrimarySize,
                            "scudo:primary_reserve"));
// ...
    for (uptr I = 0; I < NumClasses; I++) {
        RegionInfo *Region = getRegionInfo(I);
      // The actual start of a region is offset by a random number of pages
      // when PrimaryEnableRandomOffset is set.
        Region->RegionBeg =
          (PrimaryBase + (I << Config::Primary::RegionSizeLog)) +
          (Config::Primary::EnableRandomOffset
           ? ((getRandomModN(&Seed, 16) + 1) * PageSize)
           : 0);
        Region->RandState = getRandomU32(&Seed);
      // ...

      Region->MemMapInfo.MemMap = ReservedMemory.dispatch(
          PrimaryBase + (I << Config::Primary::RegionSizeLog), RegionSize);
      CHECK(Region->MemMapInfo.MemMap.isAllocated());
    }
    shuffle(RegionInfoArray, NumClasses, &Seed);
//...
}

scudo will begin by reserving all the necessary space (but not commit it) with ReservedMemory.create(). So all of the primary allocator space is contained within the same mapping, and sections of it is are made read/writable, giving the effect of guard-pages.

Classes and Regions

The regions of the primary allocator can be observed in the program mappings (/proc/PID/maps):

...
0x7d826d5000       0x7d82715000 rw-p    40000      0 [anon:scudo:primary]
0x7d82715000       0x7d926d5000 ---p  ffc0000      0 [anon_7d82715]
0x7d926d5000       0x7d92715000 rw-p    40000      0 [anon:scudo:primary]
0x7d92715000       0x7db26d7000 ---p 1ffc2000      0 [anon_7d92715]
0x7db26d7000       0x7db2717000 rw-p    40000      0 [anon:scudo:primary]
0x7db2717000       0x7e626ce000 ---p affb7000      0 [anon_7db2717]
0x7e626ce000       0x7e6270e000 rw-p    40000      0 [anon:scudo:primary]
0x7e6270e000       0x7e826ca000 ---p 1ffbc000      0 [anon_7e6270e]
...

The regions are only visible when they are "touched", meaning you won’t necessarily see a region until it is used. They are also dynamically resized on the 64-bit version of scudo, meaning the 0x40000 size is the current size of the mapping, but it can grow to 1 << Config::Primary::RegionSizeLog in size (in 0x40000 increments). Half of these entries are the scudo regions, while the other half are guard pages (and uncommitted region space) put in place to mitigate overflows across classes.

The actual region beginning is offset by a random number of pages between 1 and 16:

Region->RegionBeg =
          (PrimaryBase + (I << Config::Primary::RegionSizeLog)) +
          (Config::Primary::EnableRandomOffset
           ? ((getRandomModN(&Seed, 16) + 1) * PageSize)
           : 0);

Each mapping will be at 0x10000000 increments (on android with Config::Primary::RegionSizeLog = 28) in addition to the 1-16 guard pages added.

The following table contains the default values for the size classes in the android version of scudo (size_class_map.h):

Class ID Size
1 0x20
2 0x30
3 0x40
4 0x50
5 0x60
6 0x70
7 0x90
8 0xb0
9 0xc0
10 0xe0
11 0x120
12 0x160
13 0x1c0
14 0x250
15 0x320
16 0x450
17 0x670
18 0x830
19 0xa10
20 0xc30
21 0x1010
22 0x1210
23 0x1bd0
24 0x2210
25 0x2d90
26 0x3790
27 0x4010
28 0x4810
29 0x5a10
30 0x7310
31 0x8210
32 0x10010

Region 0 is reserved for internal allocator use.

The region information is stored in the RegionInfoArray made up of RegionInfo structs.

In scudo version 17 and newer, the region list is shuffled, making the order of the regions random.

Chunk headers

Let us have a look at the chunk metadata stored in scudo chunks.
Due to alignment the chunk metadata does not take up all the bytes it could, but rather takes up 8 bytes, followed by 8 bytes of padding.
The (unpacked) chunk header is defined in chunk.h:

struct UnpackedHeader {
  uptr ClassId : 8;
  u8 State : 2;
  // Origin if State == Allocated, or WasZeroed otherwise.
  u8 OriginOrWasZeroed : 2;
  uptr SizeOrUnusedBytes : 20;
  uptr Offset : 16;
  uptr Checksum : 16;
};
  • ClassId – Is the ID of the class the chunk belongs to.
  • state – Is the current state of the block, allocated/freed/quarentined.
  • OriginOrWasZeroed – Contains an indication of what type of call allocated the chunk. If the chunk is not allocated the field denotes if the chunk has been zeroed.
  • SizeOrUnusedBytes – The requsted size of the chunk, if a reallocation is made on the chunk but the resulting size still fits within the same size class, then only this field is modified.
  • Offset – If the chunk does not start at the beginning of a block, what is the offset from the block start to the header (almost always 0).
  • Checksum – A two byte integrity checksum.

An example chunk header is printed here:

> dump(malloc(0x103).sub(0x10), 0x10)
             0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF
738426ae60  0b 31 10 00 00 00 43 51 00 00 00 00 00 00 00 00  .1....CQ........

Here we see the 0xb class, a state of allocated, and a size of 0x103.

As an exercise to the reader, determine the state, chunk size and classid of the following header:

             0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF
73c42632c0  0f 40 24 00 00 00 26 5f 00 00 00 00 00 00 00 00  .@$...&_........

Allocating

The allocate function, called by the malloc wrapper, is defined in combined.h, here are the general steps it takes:

  • If initialization has not been done, initialize the allocator for the current thread.
  • Adjust the size request to fit alignment.
  • Compute the ClassID corresponding to the size.
  • Call TSD->getCache().allocate(ClassId) for the found class.
    • If the cache allocation fails, the region is exhausted, the allocator will try the next region.
  • Manage Memory Tagging Extension (MTE) features, which are currently deactivated on most android systems.
  • Call the malloc hook if one is defined.
  • Create the chunk header, and return the user pointer.

The general flow can be seen below:

  NOINLINE void *allocate(uptr Size, Chunk::Origin Origin,
                          uptr Alignment = MinAlignment,
                          bool ZeroContents = false) NO_THREAD_SAFETY_ANALYSIS {
    initThreadMaybe();
    // ...

    // If the requested size happens to be 0 (more common than you might think),
    // allocate MinAlignment bytes on top of the header. Then add the extra
    // bytes required to fulfill the alignment requirements: we allocate enough
    // to be sure that there will be an address in the block that will satisfy
    // the alignment.
    const uptr NeededSize =
        roundUp(Size, MinAlignment) +
        ((Alignment > MinAlignment) ? Alignment : Chunk::getHeaderSize());

    // ...

    void *Block = nullptr;
    uptr ClassId = 0;
    // ...
    if (LIKELY(PrimaryT::canAllocate(NeededSize))) {
      ClassId = SizeClassMap::getClassIdBySize(NeededSize);
      // ...
      auto *TSD = TSDRegistry.getTSDAndLock(&UnlockRequired);
      TSD->assertLocked(/*BypassCheck=*/!UnlockRequired);
      Block = TSD->getCache().allocate(ClassId);
      // If the allocation failed, retry in each successively larger class until
      // it fits. If it fails to fit in the largest class, fallback to the
      // Secondary.
      if (UNLIKELY(!Block)) {
        while (ClassId < SizeClassMap::LargestClassId && !Block)
          Block = TSD->getCache().allocate(++ClassId);
        if (!Block)
          ClassId = 0;
      }
      if (UnlockRequired)
        TSD->unlock();
    }
    if (UNLIKELY(ClassId == 0)) {
      Block = Secondary.allocate(Options, Size, Alignment, &SecondaryBlockEnd,
                                 FillContents);
    }

    // ...

    const uptr BlockUptr = reinterpret_cast<uptr>(Block);
    const uptr UnalignedUserPtr = BlockUptr + Chunk::getHeaderSize();
    const uptr UserPtr = roundUp(UnalignedUserPtr, Alignment);

    /// ...

    return TaggedPtr;
  }

Of note in the above snippet is that the size can be zero, and this will result in the smallest possible allocation being returned.
It is also important to note that if a region is full, the next region is used. Meaning the binning is not "strict" as class 2 allocations can land in the class 3 region if the class 2 region is full.

The cache

To speed up allocations, and avoid lock congestion each thread is assigned a TSD. A TSD is a kind of "lock manager" for a local cache, and there is a one-to-one mapping between caches and TSDs (it is not specified what TSD stands for). Multiple threads can share a TSD, and they are assigned to each one in a round-robin fashion:

NOINLINE void initThread(Allocator *Instance) NO_THREAD_SAFETY_ANALYSIS {
    initOnceMaybe(Instance);
    // Initial context assignment is done in a plain round-robin fashion.
    const u32 Index = atomic_fetch_add(&CurrentIndex, 1U, memory_order_relaxed);
    setCurrentTSD(&TSDs[Index % NumberOfTSDs]);
    Instance->callPostInitCallback();
}

The pointer to the assigned TSD is stored in the thread local storage (TLS) of each thread.
The getCache() call on a TSD will then return the SizeClassAllocatorLocalCache (local_cache.h) attached to that TSD.
SizeClassAllocatorLocalCache, are per-TSD collections of free chunks that the primary allocator always draws upon. The cache is essentially a collection of PerClass objects (one for each class as the name suggests), with functions for allocation, deallocation, refilling and draining for them:

  struct alignas(SCUDO_CACHE_LINE_SIZE) PerClass {
    u16 Count;
    u16 MaxCount;
    // Note: ClassSize is zero for the transfer batch.
    uptr ClassSize;
    CompactPtrT Chunks[2 * SizeClassMap::MaxNumCachedHint];
  };

These are initialized at the same time as the primary allocator:

  NOINLINE void initCache() {
    for (uptr I = 0; I < NumClasses; I++) {
      PerClass *P = &PerClassArray[I];
      const uptr Size = SizeClassAllocator::getSizeByClassId(I);
      P->MaxCount = static_cast<u16>(2 * getMaxCached(Size));
      if (I != BatchClassId) {
        P->ClassSize = Size;
      } else {
        // ClassSize in this struct is only used for malloc/free stats, which
        // should only track user allocations, not internal movements.
        P->ClassSize = 0;
      }
    }
  }

The cache has a different size per class as can be seen above, the getMaxCached() computes the maxCached for a specific size:

static u16 getMaxCachedHint(uptr Size) {
    DCHECK_NE(Size, 0);
    u32 N;
    // Force a 32-bit division if the template parameters allow for it.
    if (Config::MaxBytesCachedLog > 31 || Config::MaxSizeLog > 31)
      N = static_cast<u32>((1UL << Config::MaxBytesCachedLog) / Size);
    else
      N = (1U << Config::MaxBytesCachedLog) / static_cast<u32>(Size);

    // Note that Config::MaxNumCachedHint is u16 so the result is guaranteed to
    // fit in u16.
    return static_cast<u16>(Max(1U, Min<u32>(Config::MaxNumCachedHint, N)));
}

Here are some examples of the maxCached value for different classes:

Class ID Max Cached
1 13
2 13
3 13
4 13
15 10
16 7

Cache allocation

The function looks like the following:

void *allocate(uptr ClassId) {
    DCHECK_LT(ClassId, NumClasses);
    PerClass *C = &PerClassArray[ClassId]; // [1]
    if (C->Count == 0) {
      if (UNLIKELY(!refill(C, ClassId))) // [2]
        return nullptr;
      DCHECK_GT(C->Count, 0);
    }
    // ...
    CompactPtrT CompactP = C->Chunks[--C->Count]; // [3]
    // ... Updating stats ...
    return Allocator->decompactPtr(ClassId, CompactP);
}

It retrieves the PerClass structure at [1] and checks the cache for chunks. Assuming there are chunks to fetch, it will pop one at [3] and return it. If it has no chunks it will call refill [2] to try and refill the local cache.

Finally Allocator->decompactPtr decompacts the pointers that are internally used by scudo. These are 32-bit representations of pointers based on offsets from a region’s base:

 // Defines the type and scale of a compact pointer. A compact pointer can
 // be understood as the offset of a pointer within the region it belongs
 // to, in increments of a power-of-2 scale.
 // eg: Ptr = Base + (CompactPtr << Scale).

If the refilling of the PerClass array fails, this means that the region is exhausted. No chunk is returned, and it is up to the logic in the combined allocator to decide what to do. Recall that the combined allocator will move to the next class if the cache allocation fails.
This means that given the primitive to allocate enough, it is possible to make malloc(0x10) (normally class 1), return a pointer to the class 2 region:

Allocated class 0x0b chunk :
             0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  
7647f11180  0b 01 10 00 00 00 90 3c 00 00 00 00 00 00 00 00
Proceeding to allocate a lot of class 0x0b chunks...
Now we are in class 0x0c, after 931156 allocations:
             0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  
7657f0f680  0c 01 10 00 00 00 75 96 00 00 00 00 00 00 00 00

Cache refill

Refilling a local cache is a way of requesting more chunks from the primary allocator. If the local cache instance is out of free chunks for the requested size, refill is called:

NOINLINE bool refill(PerClass *C, uptr ClassId) {
    initCacheMaybe(C);
    TransferBatch *B = Allocator->popBatch(this, ClassId);
    if (UNLIKELY(!B))
      return false;
    DCHECK_GT(B->getCount(), 0);
    C->Count = B->getCount();
    B->copyToArray(C->Chunks);
    B->clear();
    destroyBatch(ClassId, B);
    return true;
}

popBatch is called on the allocator (primary32/64) popping a TransferBatch from the region’s BatchList. If the list is empty, an attempt is made to refill it by creating new TransferBatch objects from the region, expanding the mapping if necessary.

TransferBatch

A TransferBatch efficiently stores batches of free chunks for a designated region in a compact format. Its primary function is to swiftly transfer and replenish the cache. The TransferBatch can accommodate up to MaxNumCached chunks, which is half the maximum capacity of a PerClass list. This design ensures that there is always available space in the cache to hold any newly freed chunks from the program.

The journey of a chunk can be seen here:

Creating new TransferBatch objects and refilling the Primary allocator is done in populateFreeListAndPopBatch, the number of batches prepared depends on the class size:

const u32 NumberOfBlocks =
    Min(MaxNumBatches * MaxCount,    // [1]
        static_cast<u32>((Region->MemMapInfo.MappedUser -
                          Region->MemMapInfo.AllocatedUser) /
                         Size));
DCHECK_GT(NumberOfBlocks, 0);

constexpr u32 ShuffleArraySize =
    MaxNumBatches * TransferBatch::MaxNumCached;
CompactPtrT ShuffleArray[ShuffleArraySize]; // [2]
DCHECK_LE(NumberOfBlocks, ShuffleArraySize);

const uptr CompactPtrBase = getCompactPtrBaseByClassId(ClassId);
uptr P = RegionBeg + Region->MemMapInfo.AllocatedUser;
for (u32 I = 0; I < NumberOfBlocks; I++, P += Size)
  ShuffleArray[I] = compactPtrInternal(CompactPtrBase, P); // [3]

// ...
// [4]
shuffle(ShuffleArray + NumberOfBlocks - N, N, &Region->RandState);  
pushBlocksImpl(C, ClassId, Region, &ShuffleArray[NumberOfBlocks - N], N,
             /*SameGroup=*/true);

TransferBatch *B = popBatchImpl(C, ClassId, Region); // [5]
  • First the number of blocks to prepare in batches is calculated, as either the MaxNumBatches * MaxCount or the rest of the region, whichever is smaller. [1]
  • An array of compacted pointers are then prepared, this will be the transfer batches. [2]
  • The compacted pointers to the region space are then written into the array. [3]
  • The array is shuffled, and pushed into the freelist of the primary allocator. [4]
  • Finally a transferbatch is popped and returned [5]

The shuffling here results in the randomization of chunk allocations seen from the malloc() callers perspective. This is because the TransferBatch object copies chunks into the localcache PerClass array by using a single memcpy:

template <class SizeClassAllocator> struct TransferBatch {
// ...
  void clear() { Count = 0; }
  void moveToArray(CompactPtrT *Array) {
    memcpy(Array, Batch, sizeof(Batch[0]) * Count);
    clear();
  }
// ...
private:
  CompactPtrT Batch[MaxNumCached];
  u16 Count;
};

This is called in the refill function, meaning the internal order of chunks in a TransferBatch is preserved into the localcache. Shuffling therefore only happens when a TransferBatch is refilled.

Freeing

Freeing a chunk in scudo is relatively simple:

  • If initialization has not been done, initialize a minimal config
  • The free hook is called.
  • Checks if the chunk is allocated – throws error if not (double free).
  • calls quarantineOrDeallocateChunk.

As the rather descriptive name suggests, quarantineOrDeallocateChunk will either put the chunk into the quarantine, or push it to the local cache freelist. In the case of a disabled or bypassed quarantine, the function is essentially performing this on the local cache PerClass:

    const bool NeedToDrainCache = C->Count == C->MaxCount;
    if (NeedToDrainCache)
      drain(C, ClassId);
    C->Chunks[C->Count++] = 
        Allocator->compactPtr(ClassId, reinterpret_cast<uptr>(P));

It is essentially the opposite of the allocate function except when the cache is full, in which case it is drained.

Draining the Cache

Draining a cache is the act of releasing memory from a PerClass, back into the primary allocator. This is done whenever a local cache freelist is full. As can be seen below, half of the PerClass array is drained.

  NOINLINE void drain(PerClass *C, uptr ClassId) {
    const u16 Count = Min(static_cast<u16>(C->MaxCount / 2), C->Count);
    Allocator->pushBlocks(this, ClassId, &C->Chunks[0], Count);
    // u16 will be promoted to int by arithmetic type conversion.
    C->Count = static_cast<u16>(C->Count - Count);
    for (u16 I = 0; I < C->Count; I++)
      C->Chunks[I] = C->Chunks[I + Count];
  }

Here the chunks are pushed back to the primary allocator using TransferBatch objects as well.

Header checksum

The checksum is by default a CRC32 checksum, but scudo also supports a bsd checksum. The implementations of the checksum computation function can be found in computeChecksum:

u32 Crc = static_cast<u32>(CRC32_INTRINSIC(Seed, Value));
for (uptr I = 0; I < ArraySize; I++)
    Crc = static_cast<u32>(CRC32_INTRINSIC(Crc, Array[I]));
return static_cast<u16>(Crc ^ (Crc >> 16));

The CRC32_INTRINSIC uses processor specific instructions to speed up CRC computation.
Since the checksum uses only 2 bytes, so the 4 byte output of CRC32 is truncated by xor’ing the upper 2 bytes with the lower 2.

The checksum is only checked for validity when the chunk is free’d, reallocated, or malloc_usable_size is called on the pointer:

[SM G980F::alloc_playground ]->let source = malloc(0x60)
let alloc = null
for (let i = 0; i < 0x40; i++) {
  alloc = malloc(0x60)

  if (alloc.sub(0x70).equals(source)) {
    break
  }
}

var byteArray = new Array(0xc0).fill(0x41)
Memory.writeByteArray(source, byteArray) // No crash

free(alloc) // Crash

Error: abort was called
    at <eval> (<input>:17)

The Quarantine

A quarantine is also part of the scudo allocator, but as previously mentioned it is turned off.
The documentation has the following to say about the quarantine:

[The quarantine] offers a way to delay the deallocation operations, preventing blocks to be immediately available for reuse. Blocks held will be recycled once certain size criteria are reached. This is essentially a delayed freelist which can help mitigate some use-after-free situations. This feature is fairly costly in terms of performance and memory footprint, is mostly controlled by runtime options and is disabled by default

This gives a suggestion to the reason for it being disabled, given its performance implications.

Essentially, the quarantine will hold chunks for a certain time after they have been freed, making them unavailable for reallocation until the quarantine is full, in which case it will recycle the chunks in it back to the primary allocator.

Because of the complexity of the quarantine, and the fact that it is disabled by default, we will defer discussing it until it is introduced.

So what does it mean for exploits?

First off it means that exploitation is harder, quite a bit harder. Better primitives are needed to exploit the same bugs.

Binning

One immediate effect of the binned allocations is that allocations of different sizes will never exist next to each other:

Size 0x20 alloc at 0x7b36c841f0
Size 0x90 alloc at 0x7b96c84320
Size 0x20 alloc at 0x7b36c84130
Size 0x90 alloc at 0x7b96c836c0
Size 0x20 alloc at 0x7b36c84310
Size 0x90 alloc at 0x7b96c83cf0
Size 0x20 alloc at 0x7b36c848b0
Size 0x90 alloc at 0x7b96c82dd0

For example a buffer overflow targeting a function pointer in an object of a different size than the overflown one will not work. While this is not a new concept, it will be a new concept to people only familiar with dlmalloc and derivatives.

State

The allocator keeps a state field in the header, indicating whether the chunk is allocated or free, checking this field upon freeing and allocating removes the possibility for double frees
unless a primitive allows to overwrite a header and maintain a correct checksum. That is a topic for a different post.

Randomization

This is a big hurdle when attacking scudo, in practice it means that you cannot know where chunks land in relation to each-other. Two malloc‘ed chunks will not be placed next to each other if the freelist is empty (unlike in dlmalloc).

The randomization of the order of regions makes cross region attacks hard as guessing the offset between specific regions becomes infeasible. We can still target cross regions, but we will not know what region we land in. Given a leak from an unbusy class, we can hit the next bin up by adding 0x10010000 (next region offset plus 16 pages). This is guaranteed to be a valid pointer to the next class.

If the class is not busy, the used region space will be smaller. Thus it used to generally hold that a heap pointer added with 0x10010000 will always hit the next class if the targeted class is busier (has more used space), than the region of the leaked pointer.

This is of course no longer the case as our leaked pointer could be in the very last region, meaning we would hit invalid memory. It is also hard to tell the business of the adjacent regions, if we do not know what class they belong to.

Determinism in local cache

In scudo a sequence of malloc and free will be deterministic insofar as the sequence is shorter than the PerClass->MaxCount. Since when chunks are drained, they might be used to refill a different thread’s cache.

malloc(0x50): 0x6e8c85d8b0
malloc(0x50): 0x6e8ca1df90
malloc(0x50): 0x6e8c8b8690
malloc(0x50): 0x6e8c79af10
malloc(0x50): 0x6e8c944550
malloc(0x50): 0x6e8c9ea210
 FREEING 
malloc(0x50): 0x6e8c85d8b0
malloc(0x50): 0x6e8ca1df90
malloc(0x50): 0x6e8c8b8690
malloc(0x50): 0x6e8c79af10
malloc(0x50): 0x6e8c944550
malloc(0x50): 0x6e8c9ea210

Chunk header checksums

Unlike its older brother jemalloc, scudo does keep heap metadata on the heap, and while this usually means more corruption targets, the checksumming makes it difficult to attack the allocator itself.
There are a few ways to defeat the checksums when using an overflow:

  1. Skipping the header, if your overflow primitive is good enough, skipping past the header would defeat the mitigation.
  2. Leaking, given a linear out-of-bounds read (OOB), one could write the leaked header back.
  3. Avoiding a free(), if corrupted objects are never freed the corruption will not be detected.
  4. Recomputing checksum, with a leak of a header one could recover the random seed used for checksums. Retrieving a valid seed is relatively time-consuming.

Conclusions

The scudo allocator is simple but proves effective at hardening the heap against attacks. It succeeds in increasing the resources needed to find primitives that defeat its different mitigations, advancing your overall cybersecurity.

In the next post, we’ll go deeper into offensive cybersecurity and provide more training. We’ll look at tooling around scudo and present a practical exploit against the allocator.

Resources

[1] https://un1fuzz.github.io/articles/scudo_internals.html
[2] https://trenchant.io/scudo-hardened-allocator-unofficial-internals-documentation/
[3] https://www.synacktiv.com/publications/behind-the-shield-unmasking-scudos-defenses

By Vectorize

Share Article