A

autochitect

Explore software architecture ◆

Garbage Collection

2026-05-22InternalsRuntimeMemoryPerformance

How runtimes automatically reclaim memory, the algorithmic trade-offs between throughput and pause time, and why GC behavior should shape your service architecture.

Garbage Collection

Imagine you rent a storage unit and fill it with boxes over the course of a year. Some boxes you still use. Some you forgot about in February. A garbage collector is a service that periodically walks through your unit, figures out which boxes you can no longer reach, and clears out the space for reuse — without you having to track every box yourself.

In computing, every object your program creates occupies memory on the heap. Without automatic reclamation, you either leak memory (boxes pile up forever) or manually free allocations (you must track every box and never lose the receipt). Garbage collection automates reclamation: the runtime finds unreachable objects and reuses their memory, so you can write new Order() without writing the matching delete.

John McCarthy invented it for LISP in 1960. Today it underpins Java, Go, Python, JavaScript, C#, Ruby, and nearly every high-level language — but the algorithms differ in ways that produce dramatically different performance characteristics.

How Memory Actually Works

Before the algorithms make sense, you need the physical picture.

Your process gets virtual memory from the OS, divided into regions. The stack grows and shrinks automatically with function call depth — local variables live and die with their stack frame. The heap is the general-purpose pool: dynamic, long-lived objects go here, and the runtime manages it.

The heap is divided into a metadata layer (which objects exist, how big they are) and raw byte ranges. The allocator hands out slices; the collector reclaims them. The challenge is doing both without stopping your application or fragmenting memory into unusable gaps.

Virtual Address Space
┌──────────────────────────────────────────────────────────┐
│  Code (read-only)                                        │
├──────────────────────────────────────────────────────────┤
│  Globals / BSS                                           │
├──────────────────────────────────────────────────────────┤
│  Heap  ↓ grows down                                      │
│  ┌────────────────────────────────────────────────────┐  │
│  │  [live obj] [live obj] [DEAD] [live obj] [DEAD]    │  │  ← fragmented
│  └────────────────────────────────────────────────────┘  │
├──────────────────────────────────────────────────────────┤
│  Stack ↑ grows up                                        │
└──────────────────────────────────────────────────────────┘

Fragmentation is the core enemy. A heap can have plenty of free bytes total but still fail to allocate a 1MB object if no contiguous 1MB block exists. GC algorithms manage not just reachability but physical layout.

The Root Set: Where Reachability Begins

Every GC algorithm starts from the same premise: an object is live if it is reachable from any root. Roots are:

  • Local variables and parameters on every thread's stack
  • Global and static variables
  • CPU registers (objects mid-computation)
  • JNI / FFI handles (for languages with native interop)

An object reachable from a root is live. An object reachable from a live object is live. Everything else is garbage. The GC's job is to identify that partition accurately.

Algorithm 1: Mark and Sweep

The original algorithm. Two phases.

Mark phase: start from all roots, traverse the object graph, set a "live" bit on every reachable object. This is essentially a depth-first search over heap pointers.

Sweep phase: scan the entire heap linearly. Any object without the live bit is garbage — return its memory to the free list.

Roots → [A] → [B] → [D]
             ↘ [C]

[E] ← unreachable

Mark: A✓ B✓ C✓ D✓   E✗
Sweep: free E's memory

Cost: O(live objects) for mark, O(heap size) for sweep. The sweep is expensive because you scan the whole heap regardless of how much is dead.

The stop-the-world problem: if your application keeps creating new objects while you're traversing the graph, a new pointer might appear behind the mark cursor — you'd miss it and free a live object. The naive solution is to freeze all application threads for the duration. This is a stop-the-world (STW) pause. For a 4GB JVM heap with millions of objects, this pause can be hundreds of milliseconds.

Algorithm 2: Reference Counting

Instead of a batch trace, maintain a counter on every object: the number of references pointing to it. When a reference is created, increment. When dropped, decrement. When the count hits zero, the object is immediately dead — free it.

x = MyObject()   # refcount = 1
y = x            # refcount = 2
del x            # refcount = 1
del y            # refcount = 0 → freed immediately

Advantages: no STW pauses; memory is freed the instant it becomes unreachable; predictable, incremental cost.

The cycle problem: two objects can reference each other while nothing else references either of them.

[A] → [B]
[B] → [A]
# nothing points to A or B from roots
# both refcounts = 1 forever
# both leak

Python and Swift use reference counting as their primary algorithm but add a separate cyclic garbage collector — a periodic tracer that finds reference cycles. This is why Python has both sys.getrefcount() and the gc module.

Cost per write: every pointer assignment touches two counters. On multi-core hardware this requires atomic increments/decrements, which are 10–100× more expensive than a plain store. This is why reference counting has poor throughput when write rates are high.

Algorithm 3: Generational Collection

The generational hypothesis: most objects die young. Profiling across many programs shows that ~90% of allocated objects become garbage before their first GC cycle. This insight enables a huge optimization.

Divide the heap into generations:

  • Young/Nursery generation: new allocations go here. It's small (8–64MB typically). Collect it frequently (every few milliseconds). Since most objects are dead, collection is fast.
  • Old/Tenured generation: objects that survive multiple young-gen collections are promoted here. Collect it infrequently (every few seconds or minutes). It's larger and more expensive to scan.
Allocation path:
   new Object() → Young Gen (fast bump-pointer allocation)
       ↓ survives N collections
   Promoted → Old Gen
       ↓ survives longer
   Promoted → Permanent / Large Object Space

Bump-pointer allocation is why young-gen allocation is nearly free: maintain a pointer into a contiguous block; each allocation just moves the pointer forward by the object's size. No free-list search. Allocation cost ≈ 3 instructions.

The cost of frequent young-gen collection stays low because you only trace live objects in the young gen — but you need remembered sets to track old-gen pointers into the young gen (otherwise you'd have to scan the old gen to find roots into the young gen, defeating the purpose). This requires write barriers: every pointer store that crosses generations executes a small snippet of code to update the remembered set.

Concurrent and Incremental GC

Modern runtimes solve the STW problem by running GC concurrently with the application.

Incremental GC: do a small slice of marking work, then return control to the application, then do another slice. Spreads pause time across many small pauses (e.g., 5ms × 20 = 100ms total work, but no single 100ms pause).

Concurrent marking: run the mark phase on background threads while the application runs. The problem: the application can modify the object graph while you're marking it. A live object could become unreachable (it's safe to miss that — a floating garbage is collected next cycle) but a previously-grey object could have a new pointer to an unvisited white object added — missing that would free a live object.

Tri-color invariant solves this. Objects are one of three colors:

  • White: not yet visited (initially garbage)
  • Grey: visited, children not yet scanned
  • Black: visited, all children scanned

The invariant: no black object may point directly to a white object. Maintain this and you can pause only briefly at the start (initial mark) and end (remark) while doing the bulk of marking concurrently.

Initial state: all white
Roots → grey queue
Concurrent: greys → scan children → black; add unvisited children to grey
STW remark: find any objects mutator turned grey during concurrent phase
Sweep: all remaining white objects are garbage

Write barriers enforce the invariant during mutation. When the application writes a pointer (A.field = B), a write barrier fires. Different algorithms use different barrier types:

  • Dijkstra (insertion barrier): when you add an edge A→B, shade B grey
  • Yuasa (deletion barrier): when you remove an edge A→B, shade B grey
  • SATB (snapshot-at-the-beginning): used by JVM G1 — record the old value before overwriting

Real-World GC Implementations

JVM: G1, ZGC, Shenandoah

The JVM offers multiple GC algorithms because no single algorithm wins everywhere.

G1 (Garbage First, default since JDK 9): divides the heap into equal-sized regions (1–32MB). Young and old regions are interleaved, not contiguous. Prioritizes regions with the most garbage first (hence the name). Concurrent marking + STW evacuation. Targets 200ms pause times by default; tunable with -XX:MaxGCPauseMillis.

ZGC (JDK 15+): fully concurrent — marks and relocates concurrently. STW pauses < 1ms regardless of heap size. Achieves this via load barriers: every pointer read checks whether the object has been relocated and fixes up the pointer on the fly. Higher CPU overhead (~5–15%) for near-zero pauses.

Shenandoah: similar to ZGC, also concurrent evacuation. Maintained separately from ZGC with slightly different implementation trade-offs.

G1 pause targets:   ~50–200ms typical
ZGC pause targets:  <1ms (scales to multi-TB heaps)
Serial GC:          full STW, throughput-focused, best for small heaps

Go: Concurrent Tricolor

Go's GC is a concurrent tricolor mark-sweep. It runs almost entirely on goroutine stacks alongside application code. Key properties:

  • STW pauses < 1ms (often 100–500µs) since Go 1.14
  • Runs two GC phases per cycle; triggers based on heap growth (GOGC, default 100 — collect when heap doubles)
  • No generational GC (as of Go 1.21) — a deliberate trade-off: simpler implementation, predictable pauses, but higher allocation cost for short-lived objects compared to generational collectors
  • Write barrier: hybrid Dijkstra+Yuasa barrier enabled only during mark phase

Python: Reference Counting + Cyclic Collector

CPython uses reference counting as the primary mechanism — objects are freed immediately when refcount hits zero. A separate generational cyclic collector runs periodically to catch reference cycles. Three generations; runs less frequently for longer-lived objects.

The GIL (Global Interpreter Lock) means only one thread executes Python bytecode at a time, which simplifies the refcount implementation (no atomic operations needed for most increments/decrements) but limits concurrency.

.NET: Background Server GC

.NET's GC is generational (gen0, gen1, gen2) with a Large Object Heap (LOH, > 85KB allocations). In server GC mode (default for ASP.NET), there is one heap and one GC thread per logical CPU core — collections run in parallel across all cores, dramatically improving throughput for multi-core workloads. Background GC runs gen2 collection concurrently while foreground gen0/gen1 collections may still pause briefly.

The Latency/Throughput Spectrum

Every GC algorithm is a point on a spectrum:

High Throughput ←────────────────────────────→ Low Latency

Serial GC    G1 GC    Shenandoah    ZGC    Reference Counting
(batch jobs) (default) (medium SLA) (real-time) (Swift/ObjC UI)

Throughput GC (serial, parallel): maximize objects processed per second. Long pauses are acceptable. Best for batch processing, Hadoop jobs, offline analytics.

Low-latency GC (ZGC, Shenandoah): minimize pause time at the cost of higher CPU usage (concurrent GC threads compete with application threads). Best for APIs with P99 SLAs, trading systems, real-time analytics.

Reference counting: near-zero incremental pauses, but high write overhead and cycle problem. Best for environments where immediate deallocation matters (UI frameworks, Rust's Arc<T>).

What Architects Need to Know

GC is not just a runtime detail — it's an architecture constraint.

GC pauses affect your SLA. A 200ms G1 pause during peak load breaks a 100ms P99 SLA. Design around it: use low-latency GC, tune heap sizes to reduce collection frequency, or route latency-sensitive traffic to Go or C++ services.

Object churn is the enemy. Every short-lived object the young gen must collect costs CPU. High-throughput Java services often spend 20–30% of CPU in GC. Use object pools, value types, or off-heap buffers (Java's ByteBuffer.allocateDirect(), sun.misc.Unsafe) for hot paths.

Heap sizing has a sweet spot. A heap that is too small triggers frequent collections. A heap that is too large means each old-gen collection scans more memory — and on G1, more regions to evacuate. Rule of thumb: keep live data set ≤ 50% of heap to give the collector room to work.

GC pressure shows up in unexpected places. JSON deserialization, string concatenation, and logging frameworks are all heavy object allocators. Profile allocation rates (async-profiler, Go's pprof, .NET's dotMemory) before optimizing.

Off-heap and arena allocation bypass GC entirely. For large binary data (images, packets, ML tensors), allocating off-heap means the GC never scans it. Libraries like Chronicle Map, RocksDB's block cache, and PyTorch tensors do this explicitly. The trade-off: you manage the lifetime manually, reintroducing the possibility of leaks.

GC Tuning Quick Reference

Runtime Key Flag What It Controls
JVM G1 -XX:MaxGCPauseMillis=200 Target pause time
JVM G1 -Xms / -Xmx Initial / max heap
JVM ZGC -XX:+UseZGC Enable ZGC
Go GOGC=100 Trigger ratio (% heap growth before GC)
Go GOMEMLIMIT=512MiB Hard memory limit (Go 1.19+)
Python gc.disable() Disable cyclic collector (not refcount)
.NET <ServerGarbageCollection>true Enable server GC mode

References

Source Notes
McCarthy, 1960 — Recursive Functions of Symbolic Expressions Original mark-and-sweep in LISP
Wilson, 1992 — Uniprocessor Garbage Collection Techniques Survey of all major algorithms
Jones, Hosking, Moss — The Garbage Collection Handbook (2011) Definitive reference
JEP 439 — Generational ZGC (JDK 21) ZGC with generational support
Go GC Guide — go.dev/doc/gc-guide Go runtime GC tuning
Dijkstra et al., 1978 — On-the-fly Garbage Collection Tri-color concurrent marking