6. Storage Systems — System Design Concepts

💾 6 · STORAGE SYSTEMS

Database Choice Guide

Choose based on access patterns, guarantees, and scale

Need	Choose	Why	Examples
ACID + complex joins	SQL	Strong consistency, referential integrity	Postgres, MySQL, Aurora
High write throughput	Wide-Column	LSM-tree, masterless, tunable consistency	Cassandra, ScyllaDB
Flexible schema	Document	JSON-like, single-doc ACID	MongoDB, Firestore
Sub-ms key lookup	Key-Value	O(1) get/put, in-memory option	Redis, DynamoDB
Relationship traversal	Graph	Fast traversal without joins	Neo4j, Neptune
Full-text search	Search Engine	Inverted index, BM25 ranking	Elasticsearch, OpenSearch
Global ACID at scale	NewSQL	SQL + distributed consensus	Spanner, CockroachDB
Analytics (OLAP)	Columnar	Column-oriented, fast aggregations	BigQuery, ClickHouse, Apache Pinot

Database Internals

Data structures that power your databases — how they store, index, and retrieve data

▸ Core Storage Engines

B+Tree (Read-Optimized)

LSM Tree (Write-Optimized)

▸ 8 Data Structures That Power Databases

1. Skiplist

In-memory · O(log N) search/insert · Multiple levels act as "express lanes" — skip nodes for fast traversal. Redis sorted sets (ZADD/ZRANGE).

2. Hash Index

In-memory · O(1) average · Bucket array + chaining for collisions. Most common in-memory index solution.

3. SSTable

Disk-based · Sorted String Table — immutable sorted key-value file. Index maps keys→offsets. Building block of LSM trees. Seldom used alone.

4. WAL

Durability · Write-Ahead Log — every change logged before applying. On crash, replay to recover. Postgres, MySQL, etcd, Kafka.

5. LSM Tree

Memory+Disk · Writes to memtable, flushed to SSTables. Compaction merges. High write throughput. Cassandra, RocksDB. Disk compaction may impact reads.

6. B-Tree / B+Tree

Disk-based · Most popular DB index. Balanced tree, sorted keys, data in leaves. O(log N). Leaf nodes linked for range scans. Postgres, MySQL InnoDB.

7. Inverted Index

Search · Maps each term → list of documents. BM25 ranking. Powers full-text search in Elasticsearch / Lucene.

8. R-Tree (Spatial)

Multi-dimensional · Hierarchical bounding boxes for spatial data. Nearest neighbor, geo range queries. PostGIS, MongoDB 2dsphere.

Schema: Normalization (3NF, eliminate redundancy, needs JOINs) vs Denormalization (duplicate for read speed, no JOINs). ACID vs BASE: ACID = strong consistency (SQL). BASE = Basically Available, Soft state, Eventually consistent (NoSQL).

DB Locks: Row-level (InnoDB default) · Table-level (MyISAM) · Intent locks (signal intent) · Advisory locks (app-level). MVCC — readers see snapshot, no read locks (Postgres, MySQL InnoDB).

Database Indexing

How indexes speed up reads by minimizing disk I/O — the difference between seconds and milliseconds

▸ Index Types

Index Type	How	Use Case
Clustered	Data physically sorted by index key (1 per table)	Primary key lookups, range scans
Non-clustered	Separate structure with pointers to data rows	Secondary lookups (email, name)
Composite	Multi-column index (leftmost prefix rule)	`WHERE a=1 AND b=2`
Covering	All query columns in the index — no table access	Index-only scans (fastest reads)
Partial	Index only a subset of rows (`WHERE active=true`)	Sparse data, smaller index size
GIN	Generalized Inverted — multi-value keys	Full-text search, JSONB, arrays (Postgres)
GiST	Generalized Search Tree — spatial/range	Geo queries (PostGIS), range types
BRIN	Block Range — summary per block range	Large sequential data (time-series, logs)

▸ How Indexing Makes Your DB Faster

The problem: Your data lives on disk, not RAM. Disk reads are 10,000× slower than memory. Worse — disk reads happen at the block level (e.g., 4KB or 8KB chunks). Even reading 1 byte loads the entire block. So the key to fast queries is: read as few disk blocks as possible. That's exactly what an index does — it's a small, sorted lookup table that tells the DB which blocks to read, so it skips the rest.

① The Data: Users Table on Disk

Each row = 200 bytes. Disk block = 600 bytes. So 3 rows fit per block.

② The Index: A Tiny Sorted Lookup Table

Only stores the column you search + pointer to the row. Much smaller than data.

▸ Query: SELECT * FROM users WHERE age = 23

The DB needs to find all users aged 23. Without an index, it has no idea which blocks contain age=23 — so it reads every single block. With an index, it first checks the tiny index to learn exactly which row IDs match, then fetches only those specific blocks.

❌ Without Index — Full Table Scan

DB doesn't know where age=23 lives. Must load and check every block, one by one.

✓ With Index — Targeted Lookup

Index says "age=23 is at id=1 (block 1) and id=4 (block 2)." DB reads only those 2 blocks.

④ 100 Rows

No Idx

34 blocks

Indexed

8.5× speedup

At Scale: 1M Rows

No Idx

333K blocks

B+Tree

~100,000× speedup

⑤ Key Insights

▸ Block-level reads: Disk loads entire block even for 1 byte — so fewer blocks = faster
▸ Index is tiny: Only stores search column + pointer (8B vs 200B/row) → 17× smaller
▸ Index is sorted: Can binary search or stop early — no need to scan all entries
▸ B+Tree: Real DBs use tree indexes → O(log N) block reads, not O(N)
▸ Trade-off: Indexes speed up reads but slow down writes (must update index on every INSERT/UPDATE)

✓ When to Index

▸ Columns in WHERE clauses (filter conditions)
▸ Columns in JOIN ON (foreign keys)
▸ Columns in ORDER BY / GROUP BY
▸ High-cardinality columns (many unique values)
▸ Read-heavy tables (more reads than writes)

⚠ When NOT to Index

▸ Small tables (full scan is fast enough)
▸ Write-heavy tables (index update overhead)
▸ Low-cardinality columns (boolean, gender — few unique values)
▸ Columns rarely used in queries
▸ Too many indexes → slows INSERT/UPDATE/DELETE

Bottom line: An index is like a book's table of contents — instead of reading every page to find "Chapter 7", you look up the page number in the TOC and jump directly there. At scale, the difference between indexed and unindexed is seconds vs milliseconds.

SQL (PostgreSQL, MySQL)

ACID transactions + complex queries. The default choice when you need consistency

Guarantees: Atomicity — all or nothing. Consistency — constraints(PK, FK, UNIQUE) always enforced. Isolation — locks / MVCC prevent dirty reads. Durability — WAL survives crashes. These guarantees mean partial updates are impossible and committed data is never lost.

Postgres Feature	Detail
MVCC	Multi-version — readers see snapshot, writers create new version. No read locks.
Indexes	B-Tree (default), GIN (full-text/JSONB), GiST (geo), BRIN (large sequential)
JSONB	Binary JSON with indexing — bridge SQL and document model
Partitioning	Range/list/hash. Partition pruning speeds queries on large tables.
Extensions	PostGIS (geo), TimescaleDB (time-series), Citus (distributed), pg_trgm (fuzzy)

Scaling: Read replicas (followers serve reads) · PgBouncer (connection pooling) · Citus/Vitess (sharding) · Vertical (bigger machine)

Limitations: Write bottleneck — single leader (~1-3K writes/sec). Resharding is painful. Rigid schema — ALTER on large tables can lock. Vertical scaling ceiling.

Real-world: Instagram — sharded Postgres. Stripe — Postgres for payments. Supabase — "Firebase on Postgres".

NoSQL

Horizontal scaling, flexible schema, tunable consistency

Type	Examples	Guarantee	Use Case
Key-Value	Redis, DynamoDB	O(1) lookup, partition tolerant	Sessions, cache
Document	MongoDB, Firestore	Single-doc ACID, flexible schema	Profiles, catalogs
Wide-Column	Cassandra, HBase	Write-optimized, tunable consistency	Time-series, IoT
Graph	Neo4j, Neptune	O(1) edge traversal	Social, fraud, recommendations

Cassandra Guarantees: Masterless ring — no SPOF. Tunable consistency — QUORUM (W+R>N = strong) or ONE (fast, eventual). Anti-entropy: Read repair, Merkle trees, hinted handoff keep replicas in sync.

Limitations: No complex joins. No multi-row ACID. Eventual consistency by default. Must model around queries. Tombstone overhead.

Real-world: Discord — Cassandra (migrated to ScyllaDB). Netflix — Cassandra for viewing history. Uber — Cassandra for location data.

NewSQL

Global ACID at scale. Spanner (TrueTime + Paxos), CockroachDB (Raft), TiDB

Guarantees: Serializability globally — strongest isolation. SQL interface — familiar tools work. Trade-off: consensus latency (100ms+ multi-region) and 10x cost vs Postgres. Most apps don't need this.

Time-Series DBs

InfluxDB, Prometheus, TimescaleDB — optimized for append-heavy + range queries

Guarantees: High ingest (millions/sec). Auto-retention (TTL cleanup). Downsampling (1s→1min→1hr). Prometheus = pull-based scraping. TimescaleDB = Postgres extension (SQL for time-series).

Search (Elasticsearch)

Inverted index + BM25 ranking + fuzzy matching. NOT a primary data store — always reindex from source DB

Apache Lucene is the core search library that does the actual work — it builds inverted indexes, runs queries, scores results with BM25, and manages segments on disk. Elasticsearch wraps Lucene with a distributed layer: sharding, replication, REST API, cluster management. Solr is another wrapper. When you search in ES, every shard is a self-contained Lucene index.

▸ Why Not Just Query the DB?

❌ Without Inverted Index

✓ With Inverted Index

▸ How an Inverted Index Works (Lucene's Core)

▸ What's Inside a Posting List?

Data	What	Enables
Doc ID	Which documents contain this term	Basic search — find matching docs
Term Frequency	How many times term appears in doc	Relevance ranking — more mentions = more relevant
Position	Where in the doc the term appears (word #)	Phrase queries — "fish climbing" (adjacent words)
Offset	Character start/end position in original text	Highlighting — bold matched words in snippets

▸ Text Processing Pipeline (Lucene Analyzer)

▸ Ranking: How Results Are Scored (BM25)

Not all matches are equal. BM25 scores each document by: Term Frequency (more mentions = more relevant, with diminishing returns) × Inverse Document Frequency (rare terms matter more — "fish" in 2 of 1M docs scores higher than "the" in 999K docs) × Document Length (shorter docs with the term rank higher — a 10-word doc mentioning "fish" 3× is more focused than a 10K-word doc mentioning it 3×).

▸ Inside Lucene: Segments & Merging

Lucene doesn't update the inverted index in-place. Instead: new docs are written to an in-memory buffer → periodically flushed to an immutable segment on disk (each segment is a mini inverted index). Searches fan out across all segments and merge results. Background merge policy combines small segments into larger ones (like LSM compaction). Deletes are just tombstone markers — physically removed during merge. This is why ES has a refresh interval (default 1s) — new docs aren't searchable until the buffer is flushed to a segment.

▸ Optimizations at Scale

Technique	How	Why
Sorted Posting Lists	Doc IDs stored in sorted order	Efficient merge/intersection — O(n) for AND queries
Delta Encoding	Store gaps between IDs, not absolute IDs	Compress posting lists — critical at Google scale
Tiered / Champion Lists	Keep top-N docs per term in memory	Fast access for frequent queries — rest on disk
Positional Index	Store exact word positions in each doc	Phrase queries ("fish climbing") and proximity search
N-gram Indexing	Index bigrams/trigrams ("new york")	Multi-word search, autocomplete, fuzzy matching
Sharding	Split index across nodes (each shard = Lucene index)	Horizontal scale — scatter-gather across shards

▸ Architecture: ES wraps Lucene

Guarantees: Sub-second full-text search across billions of docs. Shard resilience — replicas survive node failure. Not ACID — eventually consistent (refresh interval delay). Not a source of truth — must reindex from primary DB.

Real-world: Wikipedia (full-text search), GitHub (code search across 200M+ repos), Uber (trip search), Netflix (log analysis via ELK). Use search_after for deep pagination (not from/size).

Blob Storage (S3)

99.999999999% durability (11 nines). Cheap, infinite scale. The default for any unstructured data — images, videos, backups, logs.

Feature	How It Works	Use Case
Pre-signed URLs	Server generates time-limited signed URL → client uploads/downloads directly to S3 (no proxy)	Large file uploads without loading app server
Versioning	Every overwrite creates a new version. Old versions retained until explicitly deleted.	Accidental delete protection, audit trail
Multipart Upload	Split large files into chunks (5MB-5GB each), upload in parallel, assemble on S3	Files >100MB, resumable uploads over flaky networks
Event Notifications	S3 triggers Lambda/SQS/SNS on PUT/DELETE events	Auto-generate thumbnails, trigger transcoding pipeline
Cross-Region Replication	Async replicate objects to another region bucket	Disaster recovery, compliance (data residency)

Guarantees: Strong read-after-write consistency (S3, since Dec 2020). 11 nines durability = lose 1 object per 100 billion stored per year. Infinite scale — no capacity planning needed. Partition prefixes for >5,500 GET/sec or >3,500 PUT/sec per prefix.

Design patterns: Media uploads — pre-signed URL → S3 → CDN. Data lake — Parquet/ORC files on S3 → query with Athena/Spark. Backups — DB snapshots → S3 → lifecycle to Glacier. Static hosting — HTML/CSS/JS on S3 + CloudFront.

Connection Pooling

Reuse expensive DB connections instead of opening per request

Pooler	Stack	Mode
PgBouncer	Postgres	session / transaction / statement
HikariCP	Java / JVM	in-process
ProxySQL	MySQL	proxy + query routing
RDS Proxy	AWS	managed, IAM-aware

Serverless gotcha: 1000 Lambda containers × 5 conns each = 5000 — your DB caps at 100. Always front Lambda with a pooler.

Schema Migrations

Version-controlled, repeatable DB changes

Tool	Stack
Flyway	JVM, polyglot SQL
Liquibase	JVM, XML/YAML changesets
Alembic	Python / SQLAlchemy
Prisma Migrate	Node / TS
Rails / Django Migrations	built-in ORM

-- V20260510__add_user_email.sql
ALTER TABLE users
  ADD COLUMN email TEXT;
CREATE UNIQUE INDEX users_email_uq
  ON users (email);

Big-table pitfall: ALTER can take an exclusive lock for hours. Use pt-online-schema-change (MySQL) or gh-ost for zero-downtime changes (shadow table + chunked copy + atomic swap).

Expand–contract: add new column → dual-write → backfill → switch reads → drop old. Never break the running app.

Vector Databases

Store + ANN-search high-dimensional embeddings for AI/RAG

DB	Notes
Pinecone / Weaviate / Milvus	Managed or OSS, native ANN
pgvector	Postgres extension — reuse existing DB
Qdrant / Chroma	Lightweight, dev-friendly
OpenSearch / Elastic	Hybrid lexical + vector

Use cases: semantic search, RAG, dedup, recommendations, image / audio similarity.

▸ ANN Index Algorithms Compared

Algorithm	How It Works	Speed	Accuracy	Memory	Best For
HNSW	Hierarchical graph — navigate layers from coarse to fine	Very fast	High (95%+)	High (in-memory graph)	Low-latency serving, <1M vectors
IVF-PQ	Cluster vectors (IVF) + compress with Product Quantization	Fast	Good (90%+)	Low (compressed)	Billions of vectors, cost-sensitive
Flat (brute force)	Compare query against every vector	Slow (O(n))	Perfect (100%)	Full vectors in RAM	Small datasets (<100K), ground truth
ScaNN	Google's anisotropic quantization + tree partitioning	Very fast	High	Medium	Google-scale, TensorFlow ecosystem

Embedding dimensions: OpenAI text-embedding-3-small = 1536 dims. Cohere = 1024. BGE/E5 = 768. Image (CLIP) = 512. Each vector = dims × 4 bytes (float32). 1M vectors × 1536 dims = ~6GB RAM. Use PQ compression for 10-50× reduction.

RAG pattern: User query → embed → ANN search top-K docs → feed docs + query to LLM → grounded answer. Vector DB is the retrieval layer that makes LLMs factual.

Graph DB Deep Dive

Index-free adjacency — O(1) hops between connected nodes

// Cypher: friends-of-friends who liked Post #42
MATCH (me:User {name:'Alice'})-[:FOLLOWS*1..2]->(f:User)
      -[:LIKES]->(p:Post {id:42})
RETURN DISTINCT f.name;

Use for	Avoid for
Social graphs, fraud cycles, recs	Tabular OLAP / aggregates
Knowledge graphs, dependency trees	High write throughput, BLOBs
Pathfinding, shortest-path	Simple CRUD apps (overkill)

Engines: Neo4j, Amazon Neptune, JanusGraph, ArangoDB (multi-model), TigerGraph (analytics-heavy).

▸ Graph DB Internals — Why O(1) Traversal?

Graph vs Relational for relationships: SQL JOIN on 5-hop social query = 5 table scans, exponential cost. Graph DB = 5 pointer follows, constant cost per hop. At 1M nodes with avg 50 edges: SQL "friends of friends of friends" = seconds. Graph = milliseconds.

When NOT to use Graph DB: Simple CRUD with no relationships. High-volume writes (>100K/sec). Aggregations/analytics over all data (use columnar DB). Storing blobs/documents. If your queries don't traverse relationships, a graph DB adds complexity for no benefit.