Garbage Collection
How runtimes automatically reclaim memory, the algorithmic trade-offs between throughput and pause time, and why GC behavior should shape your service architecture.
Garbage Collection
Imagine you rent a storage unit and fill it with boxes over the course of a year. Some boxes you still use. Some you forgot about in February. A garbage collector is a service that periodically walks through your unit, figures out which boxes you can no longer reach, and clears out the space for reuse — without you having to track every box yourself.
In computing, every object your program creates occupies memory on the heap. Without automatic reclamation, you either leak memory (boxes pile up forever) or manually free allocations (you must track every box and never lose the receipt). Garbage collection automates reclamation: the runtime finds unreachable objects and reuses their memory, so you can write new Order() without writing the matching delete.
John McCarthy invented it for LISP in 1960. Today it underpins Java, Go, Python, JavaScript, C#, Ruby, and nearly every high-level language — but the algorithms differ in ways that produce dramatically different performance characteristics.
How Memory Actually Works
Before the algorithms make sense, you need the physical picture.
Your process gets virtual memory from the OS, divided into regions. The stack grows and shrinks automatically with function call depth — local variables live and die with their stack frame. The heap is the general-purpose pool: dynamic, long-lived objects go here, and the runtime manages it.
The heap is divided into a metadata layer (which objects exist, how big they are) and raw byte ranges. The allocator hands out slices; the collector reclaims them. The challenge is doing both without stopping your application or fragmenting memory into unusable gaps.
Virtual Address Space
┌──────────────────────────────────────────────────────────┐
│ Code (read-only) │
├──────────────────────────────────────────────────────────┤
│ Globals / BSS │
├──────────────────────────────────────────────────────────┤
│ Heap ↓ grows down │
│ ┌────────────────────────────────────────────────────┐ │
│ │ [live obj] [live obj] [DEAD] [live obj] [DEAD] │ │ ← fragmented
│ └────────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────────┤
│ Stack ↑ grows up │
└──────────────────────────────────────────────────────────┘
Fragmentation is the core enemy. A heap can have plenty of free bytes total but still fail to allocate a 1MB object if no contiguous 1MB block exists. GC algorithms manage not just reachability but physical layout.
The Root Set: Where Reachability Begins
Every GC algorithm starts from the same premise: an object is live if it is reachable from any root. Roots are:
- Local variables and parameters on every thread's stack
- Global and static variables
- CPU registers (objects mid-computation)
- JNI / FFI handles (for languages with native interop)
An object reachable from a root is live. An object reachable from a live object is live. Everything else is garbage. The GC's job is to identify that partition accurately.
Algorithm 1: Mark and Sweep
The original algorithm. Two phases.
Mark phase: start from all roots, traverse the object graph, set a "live" bit on every reachable object. This is essentially a depth-first search over heap pointers.
Sweep phase: scan the entire heap linearly. Any object without the live bit is garbage — return its memory to the free list.
Roots → [A] → [B] → [D]
↘ [C]
[E] ← unreachable
Mark: A✓ B✓ C✓ D✓ E✗
Sweep: free E's memory
Cost: O(live objects) for mark, O(heap size) for sweep. The sweep is expensive because you scan the whole heap regardless of how much is dead.
The stop-the-world problem: if your application keeps creating new objects while you're traversing the graph, a new pointer might appear behind the mark cursor — you'd miss it and free a live object. The naive solution is to freeze all application threads for the duration. This is a stop-the-world (STW) pause. For a 4GB JVM heap with millions of objects, this pause can be hundreds of milliseconds.
Algorithm 2: Reference Counting
Instead of a batch trace, maintain a counter on every object: the number of references pointing to it. When a reference is created, increment. When dropped, decrement. When the count hits zero, the object is immediately dead — free it.
x = MyObject() # refcount = 1
y = x # refcount = 2
del x # refcount = 1
del y # refcount = 0 → freed immediately
Advantages: no STW pauses; memory is freed the instant it becomes unreachable; predictable, incremental cost.
The cycle problem: two objects can reference each other while nothing else references either of them.
[A] → [B]
[B] → [A]
# nothing points to A or B from roots
# both refcounts = 1 forever
# both leak
Python and Swift use reference counting as their primary algorithm but add a separate cyclic garbage collector — a periodic tracer that finds reference cycles. This is why Python has both sys.getrefcount() and the gc module.
Cost per write: every pointer assignment touches two counters. On multi-core hardware this requires atomic increments/decrements, which are 10–100× more expensive than a plain store. This is why reference counting has poor throughput when write rates are high.
Algorithm 3: Generational Collection
The generational hypothesis: most objects die young. Profiling across many programs shows that ~90% of allocated objects become garbage before their first GC cycle. This insight enables a huge optimization.
Divide the heap into generations:
- Young/Nursery generation: new allocations go here. It's small (8–64MB typically). Collect it frequently (every few milliseconds). Since most objects are dead, collection is fast.
- Old/Tenured generation: objects that survive multiple young-gen collections are promoted here. Collect it infrequently (every few seconds or minutes). It's larger and more expensive to scan.
Allocation path:
new Object() → Young Gen (fast bump-pointer allocation)
↓ survives N collections
Promoted → Old Gen
↓ survives longer
Promoted → Permanent / Large Object Space
Bump-pointer allocation is why young-gen allocation is nearly free: maintain a pointer into a contiguous block; each allocation just moves the pointer forward by the object's size. No free-list search. Allocation cost ≈ 3 instructions.
The cost of frequent young-gen collection stays low because you only trace live objects in the young gen — but you need remembered sets to track old-gen pointers into the young gen (otherwise you'd have to scan the old gen to find roots into the young gen, defeating the purpose). This requires write barriers: every pointer store that crosses generations executes a small snippet of code to update the remembered set.
Concurrent and Incremental GC
Modern runtimes solve the STW problem by running GC concurrently with the application.
Incremental GC: do a small slice of marking work, then return control to the application, then do another slice. Spreads pause time across many small pauses (e.g., 5ms × 20 = 100ms total work, but no single 100ms pause).
Concurrent marking: run the mark phase on background threads while the application runs. The problem: the application can modify the object graph while you're marking it. A live object could become unreachable (it's safe to miss that — a floating garbage is collected next cycle) but a previously-grey object could have a new pointer to an unvisited white object added — missing that would free a live object.
Tri-color invariant solves this. Objects are one of three colors:
- White: not yet visited (initially garbage)
- Grey: visited, children not yet scanned
- Black: visited, all children scanned
The invariant: no black object may point directly to a white object. Maintain this and you can pause only briefly at the start (initial mark) and end (remark) while doing the bulk of marking concurrently.
Initial state: all white
Roots → grey queue
Concurrent: greys → scan children → black; add unvisited children to grey
STW remark: find any objects mutator turned grey during concurrent phase
Sweep: all remaining white objects are garbage
Write barriers enforce the invariant during mutation. When the application writes a pointer (A.field = B), a write barrier fires. Different algorithms use different barrier types:
- Dijkstra (insertion barrier): when you add an edge A→B, shade B grey
- Yuasa (deletion barrier): when you remove an edge A→B, shade B grey
- SATB (snapshot-at-the-beginning): used by JVM G1 — record the old value before overwriting
Real-World GC Implementations
JVM: G1, ZGC, Shenandoah
The JVM offers multiple GC algorithms because no single algorithm wins everywhere.
G1 (Garbage First, default since JDK 9): divides the heap into equal-sized regions (1–32MB). Young and old regions are interleaved, not contiguous. Prioritizes regions with the most garbage first (hence the name). Concurrent marking + STW evacuation. Targets 200ms pause times by default; tunable with -XX:MaxGCPauseMillis.
ZGC (JDK 15+): fully concurrent — marks and relocates concurrently. STW pauses < 1ms regardless of heap size. Achieves this via load barriers: every pointer read checks whether the object has been relocated and fixes up the pointer on the fly. Higher CPU overhead (~5–15%) for near-zero pauses.
Shenandoah: similar to ZGC, also concurrent evacuation. Maintained separately from ZGC with slightly different implementation trade-offs.
G1 pause targets: ~50–200ms typical
ZGC pause targets: <1ms (scales to multi-TB heaps)
Serial GC: full STW, throughput-focused, best for small heaps
Go: Concurrent Tricolor
Go's GC is a concurrent tricolor mark-sweep. It runs almost entirely on goroutine stacks alongside application code. Key properties:
- STW pauses < 1ms (often 100–500µs) since Go 1.14
- Runs two GC phases per cycle; triggers based on heap growth (GOGC, default 100 — collect when heap doubles)
- No generational GC (as of Go 1.21) — a deliberate trade-off: simpler implementation, predictable pauses, but higher allocation cost for short-lived objects compared to generational collectors
- Write barrier: hybrid Dijkstra+Yuasa barrier enabled only during mark phase
Python: Reference Counting + Cyclic Collector
CPython uses reference counting as the primary mechanism — objects are freed immediately when refcount hits zero. A separate generational cyclic collector runs periodically to catch reference cycles. Three generations; runs less frequently for longer-lived objects.
The GIL (Global Interpreter Lock) means only one thread executes Python bytecode at a time, which simplifies the refcount implementation (no atomic operations needed for most increments/decrements) but limits concurrency.
.NET: Background Server GC
.NET's GC is generational (gen0, gen1, gen2) with a Large Object Heap (LOH, > 85KB allocations). In server GC mode (default for ASP.NET), there is one heap and one GC thread per logical CPU core — collections run in parallel across all cores, dramatically improving throughput for multi-core workloads. Background GC runs gen2 collection concurrently while foreground gen0/gen1 collections may still pause briefly.
The Latency/Throughput Spectrum
Every GC algorithm is a point on a spectrum:
High Throughput ←────────────────────────────→ Low Latency
Serial GC G1 GC Shenandoah ZGC Reference Counting
(batch jobs) (default) (medium SLA) (real-time) (Swift/ObjC UI)
Throughput GC (serial, parallel): maximize objects processed per second. Long pauses are acceptable. Best for batch processing, Hadoop jobs, offline analytics.
Low-latency GC (ZGC, Shenandoah): minimize pause time at the cost of higher CPU usage (concurrent GC threads compete with application threads). Best for APIs with P99 SLAs, trading systems, real-time analytics.
Reference counting: near-zero incremental pauses, but high write overhead and cycle problem. Best for environments where immediate deallocation matters (UI frameworks, Rust's Arc<T>).
What Architects Need to Know
GC is not just a runtime detail — it's an architecture constraint.
GC pauses affect your SLA. A 200ms G1 pause during peak load breaks a 100ms P99 SLA. Design around it: use low-latency GC, tune heap sizes to reduce collection frequency, or route latency-sensitive traffic to Go or C++ services.
Object churn is the enemy. Every short-lived object the young gen must collect costs CPU. High-throughput Java services often spend 20–30% of CPU in GC. Use object pools, value types, or off-heap buffers (Java's ByteBuffer.allocateDirect(), sun.misc.Unsafe) for hot paths.
Heap sizing has a sweet spot. A heap that is too small triggers frequent collections. A heap that is too large means each old-gen collection scans more memory — and on G1, more regions to evacuate. Rule of thumb: keep live data set ≤ 50% of heap to give the collector room to work.
GC pressure shows up in unexpected places. JSON deserialization, string concatenation, and logging frameworks are all heavy object allocators. Profile allocation rates (async-profiler, Go's pprof, .NET's dotMemory) before optimizing.
Off-heap and arena allocation bypass GC entirely. For large binary data (images, packets, ML tensors), allocating off-heap means the GC never scans it. Libraries like Chronicle Map, RocksDB's block cache, and PyTorch tensors do this explicitly. The trade-off: you manage the lifetime manually, reintroducing the possibility of leaks.
GC Tuning Quick Reference
| Runtime | Key Flag | What It Controls |
|---|---|---|
| JVM G1 | -XX:MaxGCPauseMillis=200 |
Target pause time |
| JVM G1 | -Xms / -Xmx |
Initial / max heap |
| JVM ZGC | -XX:+UseZGC |
Enable ZGC |
| Go | GOGC=100 |
Trigger ratio (% heap growth before GC) |
| Go | GOMEMLIMIT=512MiB |
Hard memory limit (Go 1.19+) |
| Python | gc.disable() |
Disable cyclic collector (not refcount) |
| .NET | <ServerGarbageCollection>true |
Enable server GC mode |
References
| Source | Notes |
|---|---|
| McCarthy, 1960 — Recursive Functions of Symbolic Expressions | Original mark-and-sweep in LISP |
| Wilson, 1992 — Uniprocessor Garbage Collection Techniques | Survey of all major algorithms |
| Jones, Hosking, Moss — The Garbage Collection Handbook (2011) | Definitive reference |
| JEP 439 — Generational ZGC (JDK 21) | ZGC with generational support |
| Go GC Guide — go.dev/doc/gc-guide | Go runtime GC tuning |
| Dijkstra et al., 1978 — On-the-fly Garbage Collection | Tri-color concurrent marking |