Memory Allocators

Think of memory as a large hotel. Your program checks in guests (objects) and checks them out (frees them) constantly. The allocator is the front desk: it assigns rooms, tracks which rooms are vacant, and tries to avoid leaving awkward gaps that are too small for any new guest. A bad front desk causes long check-in lines; a great one handles thousands of guests per second without you noticing.

In computing, every call to malloc, new, or make goes through an allocator. The allocator decides which bytes to give you, how to track that you have them, and how to reclaim them when you call free. The algorithm it uses determines whether that call takes 10 nanoseconds or 10 microseconds — a difference that compounds across millions of allocations.

How a Process Gets Memory

Before the allocator, the OS.

Your process starts with a virtual address space — a flat range of addresses (typically 48-bit on x86-64, so 128 TiB). This is virtual, not physical: the OS maps it to RAM pages on demand. The allocator lives in user-space and manages a region of this address space called the heap.

Virtual Address Space (x86-64 process)
┌─────────────────────────────────────────────┐  0xFFFFFFFFFFFF
│  Kernel space (not accessible)              │
├─────────────────────────────────────────────┤  0x7FFFFFFFFFFF
│  Stack  (grows ↓, fixed thread limit ~8MB)  │
├─────────────────────────────────────────────┤
│  mmap region (shared libs, large mallocs)   │
├─────────────────────────────────────────────┤
│  Heap   (grows ↑ via brk/mmap syscalls)     │
├─────────────────────────────────────────────┤
│  BSS / Data / Text (code + globals)         │
└─────────────────────────────────────────────┘  0x000000000000

When the heap needs to grow, the allocator asks the OS for more pages via brk() (extends the heap contiguously) or mmap() (maps a new anonymous region). These are expensive syscalls — hundreds of nanoseconds each. A good allocator batches these requests, grabs large chunks from the OS, and subdivides them internally to serve individual malloc calls cheaply.

The Core Problem: Fragmentation

The allocator's enemy is fragmentation: free memory that cannot be used because it is split into chunks too small or scattered to satisfy a request.

Heap after several alloc/free cycles:
┌──────────────────────────────────────────────────────────┐
│ [USED 64B] [FREE 8B] [USED 128B] [FREE 12B] [USED 32B]  │
└──────────────────────────────────────────────────────────┘
  Total free: 20 bytes. Request for 16 bytes: FAILS.

Two types:

External fragmentation: free space is spread across many small non-contiguous gaps.
Internal fragmentation: a 17-byte request gets a 32-byte slot (next power of two), wasting 15 bytes inside the allocation.

Every allocator algorithm is a different answer to the fragmentation trade-off.

Algorithm 1: Free Lists (glibc ptmalloc2)

The classic design. Maintain a linked list of free chunks. Each chunk has a header recording its size and whether it is free.

Chunk layout (glibc):
┌──────────────────────────────────────┐
│  prev_size (8B, if prev chunk free)  │
│  size + flags (8B)                   │  ← CHUNK HEADER
│  fd (forward ptr, when free)         │
│  bk (back ptr, when free)            │
│                                      │
│  user data (returned to caller)      │
│                                      │
└──────────────────────────────────────┘

On malloc(n): search the free list for a chunk ≥ n. Split it if it is much larger. On free(p): read the chunk header, insert back into the free list, coalesce with adjacent free chunks to reduce fragmentation.

Bins for speed: glibc organizes free lists into size-class buckets called bins. Small requests (≤ 512B on 64-bit) go to fast bins (per-size-class LIFO stacks, no coalescing — very fast). Larger requests go to small bins or large bins (sorted free lists searched with a best-fit strategy). Huge allocations (≥ 128KB by default) bypass the heap and call mmap directly, then munmap on free.

The multithreading problem: glibc uses one arena (heap + metadata) per thread (up to 8× CPU cores), so threads rarely contend. But metadata operations still require per-arena locks. Under high concurrency, lock contention becomes the bottleneck.

Algorithm 2: Slab / Size-Class Allocators (jemalloc, tcmalloc, mimalloc)

Modern allocators abandon the unified free list in favor of size classes: pre-defined buckets (e.g., 8, 16, 32, 48, 64, 80, 96… bytes) where every allocation rounds up to the next class. Each size class gets its own pool of same-sized chunks called slabs (or spans).

Size-class pools (jemalloc sketch):
  [8B  slabs]: [slot][slot][slot]…  ← 100% utilization possible, zero header
  [16B slabs]: [slot][slot][slot]…
  [32B slabs]: [slot][slot][slot]…
  ...
  [Large]:     separate mmap-backed extent per allocation

Benefits:

No header per allocation for small objects: size is implicit from which slab the address falls in.
No search: malloc(12) rounds to 16 → pop from the 16B free list. O(1), no scanning.
Low internal fragmentation: tight size classes (jemalloc uses ~232 size classes) waste ≤ ~25% on average.

Thread-local caches (TCaches): each thread holds a private cache of recently freed chunks per size class. malloc and free hit only thread-local state — zero lock contention on the fast path. When the cache overflows, chunks are returned to the shared arena in bulk.

malloc(32) fast path:
  1. Read thread-local tcache[32B].top          ← ~3 cycles
  2. Return pointer, decrement tcache count      ← ~2 cycles
  Total: ~5 cycles = ~2 ns on modern hardware

malloc(32) slow path (cache miss):
  3. Lock arena, refill tcache from slab         ← ~50–200 cycles

jemalloc (used by Firefox, Facebook, FreeBSD libc) and mimalloc (Microsoft, used in .NET) are the two implementations most worth knowing. Benchmarks consistently show them 10–50% faster than glibc on allocation-heavy workloads. They are drop-in replacements: LD_PRELOAD=libjemalloc.so ./myapp.

Algorithm 3: Bump-Pointer Allocators (Arenas)

The fastest possible allocator. Maintain a single pointer into a contiguous block of memory. Allocate by bumping the pointer forward.

Arena (bump-pointer):
┌────────────────────────────────────────────────┐
│  [obj A][obj B][obj C][obj D]  →  free          │
└────────────────────────────────────────────────┘
                              ↑ bump ptr

malloc(n): ptr += align(n); return old_ptr;   // 2–3 instructions
free(p):   no-op.                              // completely free

free is a no-op. The entire arena is released in one shot when the work unit completes. There is no per-object tracking.

When to use it: request-scoped allocations in a web server (allocate freely during the request, reset the arena at the end), parse trees (allocate nodes during parsing, discard the whole tree when done), game frames (allocate render objects for a frame, reset at frame end).

The trade-off: you cannot free individual objects. If your lifetimes are mixed, you leak memory until the arena resets.

The Extreme Case: No Allocation at All

Some systems take arena thinking to its logical conclusion and pre-allocate all memory at startup. There is no malloc after initialization. No free. No allocator overhead. No GC.

TigerBeetle: Static Memory as a Design Constraint

TigerBeetle is a financial-grade database written in Zig. It defines its maximum memory footprint at startup — a fixed buffer for every data structure the system will ever use: the LSM tree's table blocks, the WAL, client reply buffers, hash map slots, and every queue.

TigerBeetle startup:
  1. Read config: grid_size, accounts_max, transfers_max, ...
  2. Allocate one large buffer from the OS
  3. Partition it into named regions of fixed capacity
  4. Never call allocate() again

Runtime:
  new_account()  → write into pre-allocated accounts[] at next index
  commit_batch() → write into WAL ring buffer (fixed slots)
  cache_lookup() → read from pre-allocated cache[]

  No malloc. No free. No GC pauses. Ever.

This is sometimes called static allocation or arena-at-startup. The OS hands over memory once; the application subdivides and reuses it for its entire lifetime.

Advantages of Fixed Memory

Mathematically predictable latency. When there is no allocator in the hot path, there is no allocator jitter. A commit_batch() call that takes 800µs will always take roughly 800µs. P99 ≈ P50. This is the property financial systems, real-time control, and game engines require — and it is impossible to guarantee with a general-purpose allocator.

No GC pauses. Because no objects are dynamically allocated, there is nothing for a garbage collector to trace. TigerBeetle is written in Zig, which has no GC, and its design makes even a hypothetical GC irrelevant.

No memory fragmentation. Fragmentation is a property of a dynamic allocator subdividing a heap over time. If the heap is pre-partitioned into fixed regions that are never subdivided, fragmentation cannot occur.

Simpler failure model. Memory exhaustion is detected at startup, not during a transaction at 3 AM. If the machine does not have enough RAM for the configured capacity, TigerBeetle exits immediately with a clear error — not an OOM kill mid-operation.

Better cache behavior. Pre-allocated arrays of fixed-size objects are cache-friendly by construction. A 64-byte struct array lays out contiguously in memory; a heap of pointer-chased malloc'd objects does not.

No allocator lock contention. Under high concurrency, even jemalloc's thread-local caches occasionally hit the per-arena lock. Static memory has no such path.

Disadvantages of Fixed Memory

Rigid capacity. The system cannot grow beyond the parameters set at startup. If you configure TigerBeetle for 10 million accounts and you reach 10 million, you cannot create account 10,000,001 without restarting with a larger configuration. This requires capacity planning — an operational discipline many teams lack.

Memory is reserved, not consumed. A TigerBeetle instance configured for 100 GB will map 100 GB of virtual address space at startup even if 90% of those slots are empty. Physical RAM is committed on access (demand paging), but reserved virtual address space can matter in shared environments or on 32-bit systems.

Upfront design complexity. Every data structure must have a bounded maximum size decided before a line of business logic is written. This requires knowing your workload deeply. A general allocator lets you defer that decision; static allocation forces it.

Not universally applicable. This model works for systems with well-defined, stable workloads: a database with a known maximum number of accounts, a real-time engine with a fixed number of entities per frame, a network packet processor with bounded queue depth. It does not work for a general-purpose web server that must handle arbitrary request payloads of unknown size.

One size is always wrong. Conservative configuration wastes RAM. Aggressive configuration causes startup failure. Getting the right number requires load testing and headroom analysis — ongoing work for the lifetime of the system.

Comparison at a Glance

Property	glibc malloc	jemalloc	Arena (reset)	Fixed at startup
Allocation latency	~50–200 ns	~5–50 ns	~1–5 ns	0 ns (no alloc)
Free latency	~50–200 ns	~5–50 ns	0 ns (batch)	0 ns (no free)
Fragmentation	High (external)	Low	None	None
GC pressure	Depends on runtime	Depends	Low	Zero
Max heap	Dynamic	Dynamic	Arena size	Fixed at startup
Suitable for	General use	General use	Short-lived batches	Bounded workloads

What Architects Need to Know

Swap the allocator before writing custom pools. Replacing glibc with jemalloc or mimalloc is a one-line change (LD_PRELOAD or a link-time flag) that frequently delivers 10–20% throughput improvement in allocation-heavy services. Profile first (heaptrack, massif, pprof), then swap.

Measure allocation rates. Services allocating > 100 MB/s are spending measurable CPU on the allocator. That rate is sustainable with jemalloc; it is a red flag with glibc. At > 1 GB/s you should investigate whether arenas or object pools make sense for the hot path.

Short-lived allocations belong in arenas. HTTP request parsing, SQL query planning, JSON deserialization — these create objects that all die at the same time. An arena allocator reduces those to a pointer bump plus a single reset, eliminating both allocation overhead and GC pressure.

Consider static allocation for latency-critical systems. If you are building a trading engine, a real-time simulation, or a financial ledger and you need P99 latency guarantees, the TigerBeetle model is worth the operational discipline it requires. The latency you buy is not a tuning parameter — it is a structural property of the design.

The allocator is not magic. It can only work with the memory access patterns your code creates. Pointer-chased linked lists will miss the cache regardless of which allocator you use. Layout your data structures for sequential access; the allocator determines overhead, not cache behavior.

References

Source	Notes
Berger et al., 2000 — Hoard: A Scalable Memory Allocator for Multithreaded Applications	Foundation for modern per-thread arenas
Evans, 2006 — A Scalable Concurrent malloc(3) Implementation for FreeBSD	jemalloc original paper
Leijen et al., 2019 — Mimalloc: Free List Sharding in Action	mimalloc design; Microsoft Research
TigerBeetle source — main.zig, vsr.zig	Static allocation in production
Drepper, 2007 — What Every Programmer Should Know About Memory	CPU cache, NUMA, and allocator interaction