Storage Durability

Imagine you ask your bank to transfer $500. The transfer completes, you get a confirmation — and then the server loses power. Does the money move?

Durability is the guarantee that says yes: once a system says a write is committed, that write will survive any subsequent crash. It is the "D" in ACID, and it is one of the hardest properties to implement efficiently because storage hardware was not designed with software crashes in mind.

The Dangerous Gap: RAM vs Disk

Every write starts in RAM. When your code calls write(), the data lands in the OS page cache — volatile memory that is fast but disappears if the machine loses power or the kernel crashes.

Application
    │  write(fd, data, len)
    ▼
OS Page Cache  ◄── volatile, lost on crash
    │  (eventually, when it feels like it)
    ▼
Storage device ◄── persistent

The OS flushes dirty pages to disk on its own schedule — typically every 5–30 seconds, or when memory pressure demands it. That window is your data loss exposure. Any write acknowledged to the user but not yet flushed is gone if the machine dies.

fsync: The Durability Hammer

fsync(fd) tells the OS: "flush every dirty page for this file to the physical storage device, and don't return until the device confirms the data is stable."

write(fd, data, len);  // lands in page cache
fsync(fd);             // blocks until durable on device
// only NOW is the write safe to acknowledge

This works, but it's slow. fsync must:

Push dirty pages from the page cache to the device driver
Instruct the device to flush its internal write buffer to stable media
Wait for the device to confirm completion

Typical costs:

Device	fsync latency
Consumer SSD (NVMe)	50–200 µs
Enterprise SSD (NVMe, capacitor-backed)	10–50 µs
HDD (spinning, 7200 RPM)	5–20 ms
Cloud block volume (EBS, GCP PD)	1–5 ms (network round trip)

One fsync per transaction limits throughput to ~100 TPS on a spinning disk and ~5,000–20,000 TPS on enterprise NVMe.

The Write-Ahead Log (WAL)

Raw fsync per user write is too expensive. Databases solve this with a write-ahead log (WAL): a sequential, append-only file that records every intended change before the change is applied to the main data file.

Client write
    │
    ├─► Append record to WAL ──► fsync WAL ──► acknowledge client
    │
    └─► Apply to in-memory data structures (async, no fsync needed yet)
         │
         └─► Flush to main data file periodically (checkpoint)

Why does WAL help?

Sequential writes are fast: appending to a log avoids random I/O, which is catastrophic for HDDs and still meaningful for SSDs (write amplification).
Single fsync per batch: the WAL is flushed once; the main data files can be written lazily.
Crash recovery: on restart, replay any WAL records that weren't checkpointed. The main data file might be partially written, but the WAL is the source of truth.

PostgreSQL calls this WAL. MySQL/InnoDB calls it the redo log. RocksDB calls it the Write-Ahead Log. SQLite calls it the journal.

Group Commit

Fsyncing once per transaction is still expensive if you have thousands of concurrent writers. Group commit batches multiple transactions into a single fsync:

T1 write ──┐
T2 write ──┼──► accumulate in buffer ──► single fsync ──► notify T1, T2, T3
T3 write ──┘

PostgreSQL's synchronous_commit and InnoDB's innodb_flush_log_at_trx_commit both expose group commit tuning. A 10 ms group commit window can turn 100 TPS (serial fsync) into 10,000 TPS with the same disk.

Hardware Write Caches and the Lie Underneath

Storage devices have their own DRAM write cache. Without special care, a device may report "write complete" before the data reaches stable media. This is called a volatile write cache.

fsync instructs the device to flush this cache (using the ATA FLUSH CACHE command or NVMe Flush). But:

Consumer SSDs may ignore the flush command in certain power modes.
HDDs with battery-backed caches (BBU) can safely acknowledge writes before platter commit — that's fine, the battery covers power loss.
Cloud block volumes implement durability at the storage backend; the device flush semantics are virtualised.

Always verify whether your storage device (or cloud volume) honours flush commands if you need real durability guarantees. PostgreSQL's pg_test_fsync tool measures this.

O_DIRECT and Bypassing the Page Cache

Some databases (InnoDB, Oracle) use O_DIRECT to bypass the OS page cache entirely and manage their own buffer pool:

Application buffer pool  ──O_DIRECT──►  Storage device
                                (no page cache in between)

Advantages:

No double-buffering (OS cache + DB cache)
Predictable memory usage
Write ordering is under DB control

Disadvantages:

Requires aligned I/O (512-byte or 4096-byte boundaries)
Loses OS read-ahead benefits for sequential scans

Most databases combine O_DIRECT with O_SYNC (or explicit fsync) for durable writes.

Durability Levels: The Spectrum

Not every system needs full per-write durability. Here's the practical spectrum:

Level	Mechanism	Data loss on crash	Typical use
None	`write()` only	Up to 30 s (OS flush interval)	Caches, analytics scratch
Periodic sync	Background fsync every N seconds	Up to N seconds	Redis `appendfsync everysec`
Per-transaction WAL	fsync WAL on commit	Zero (last committed transaction)	PostgreSQL default
Per-write sync	`O_SYNC` or fsync every write	Zero, even mid-transaction	Financial ledgers
Replicated	Sync replication to N nodes	Zero, survives node failure	etcd, Google Spanner

Crash Recovery: Replaying the WAL

When a database restarts after a crash, it runs recovery:

1. Open the WAL
2. Find the last checkpoint (last known consistent state)
3. Replay all WAL records after the checkpoint
4. Reach a consistent state, then open for business

This is possible because the WAL is written before the main data file. Even if the checkpoint is stale and the data file is partially written, the WAL has the full intent of every committed transaction.

InnoDB calls this crash recovery. PostgreSQL calls it WAL replay. ZFS uses a similar concept called the intent log (ZIL).

Filesystems and Journaling

Modern filesystems (ext4, XFS, APFS, NTFS) are themselves durable through journaling — a WAL for filesystem metadata (directory entries, inode tables, extent maps).

Without journaling, a crash mid-write could leave the directory tree inconsistent: a file appears in the directory but its inode is zeroed. Journaling ensures metadata changes are atomic.

Most journaled filesystems default to ordered mode: data is written to disk before the journal records the metadata change. This protects against metadata corruption but not against data loss in the dirty page window.

Full data journaling (ext4 data=journal) journals both data and metadata — durable, but doubles write amplification and is rarely used.

ZFS: Durability by Default

ZFS takes a different approach. It uses copy-on-write semantics: it never overwrites live data. Instead, every write goes to a new block, and the superblock is atomically updated to point to it.

This means:

No partial writes to live data — the old data is always consistent
The ZIL (ZFS Intent Log) provides synchronous write durability for applications that call fsync
ZFS can dedicate a fast NVMe device as a SLOG (separate intent log) to absorb synchronous writes cheaply

The Durability–Performance Trade-off

Every durability guarantee costs latency. Here's how the major knobs trade:

Knob	Safer →	Faster →
PostgreSQL `synchronous_commit`	`on`	`off` (risk: 3× `wal_writer_delay` data loss)
InnoDB `innodb_flush_log_at_trx_commit`	`1` (fsync/commit)	`2` (OS flush) or `0` (none)
Redis `appendfsync`	`always`	`everysec` or `no`
Replication	synchronous	asynchronous

Understanding the guarantee each setting provides — and the exact failure scenario where it falls short — is what separates "probably works" from "provably durable."

What Architects Need to Know

Durability is a contract, not a feature. Before choosing a storage system, define exactly what failure scenarios you must survive — process crash, OS crash, full power loss, or node failure — and verify the system's guarantees against each one. "We use Postgres" is not a durability strategy; synchronous_commit = off silently weakens it.

fsync lies are real. Consumer SSDs and some virtualised block devices acknowledge writes before they hit stable media. If your workload is financial or audit-critical, run pg_test_fsync (or equivalent) on your actual hardware in your actual environment. Cloud vendors document their durability SLAs; read them.

WAL is the load-bearing beam. Almost every durable system — databases, message queues, event stores — is a WAL with indexes bolted on. Understanding WAL mechanics (sequential write, checkpoint, replay) lets you reason about any of them, not just Postgres.

Durability and availability are separate axes. A single-node database with fsync = on is durable (survives crash) but not available (fails when the node goes down). Synchronous replication adds availability but doubles write latency. Async replication improves latency but creates a data-loss window. These are explicit trade-offs to choose deliberately, not defaults to inherit.

Where you place the fsync boundary determines your throughput ceiling. One fsync per user request at 10 ms each caps you at 100 RPS from a single writer thread. Group commit, pipelining, or moving fsync to a replicated log (like Kafka) are the standard escapes. Measure your actual fsync latency before optimising — SSD and spinning disk differ by two orders of magnitude.

Cloud storage abstracts but doesn't eliminate the problem. EBS, GCP Persistent Disk, and Azure Managed Disks provide durable block storage with their own replication. But durability at the block level doesn't protect against torn writes at the application level — you still need WAL or fsync for database consistency. Object stores (S3, GCS) offer 11-nines durability on completed writes, but have no partial-write atomicity guarantee; incomplete uploads are silently discarded.