System Design Concepts

No fluff — visual, concise, interview-ready

💾 6 · STORAGE SYSTEMS

Database Choice Guide

Choose based on access patterns, guarantees, and scale

NeedChooseWhyExamples
ACID + complex joinsSQLStrong consistency, referential integrityPostgres, MySQL, Aurora
High write throughputWide-ColumnLSM-tree, masterless, tunable consistencyCassandra, ScyllaDB
Flexible schemaDocumentJSON-like, single-doc ACIDMongoDB, Firestore
Sub-ms key lookupKey-ValueO(1) get/put, in-memory optionRedis, DynamoDB
Relationship traversalGraphFast traversal without joinsNeo4j, Neptune
Full-text searchSearch EngineInverted index, BM25 rankingElasticsearch, OpenSearch
Global ACID at scaleNewSQLSQL + distributed consensusSpanner, CockroachDB
Analytics (OLAP)ColumnarColumn-oriented, fast aggregationsBigQuery, ClickHouse, Apache Pinot

Database Internals

Data structures that power your databases — how they store, index, and retrieve data

Core Storage Engines

B+Tree (Read-Optimized)

7 16 1 · 2 · 5 · 6 9 · 11 17 · 22 · 23 Sorted keys · data in leaves · O(log N) Postgres · MySQL InnoDB Best for: reads (point + range queries)

LSM Tree (Write-Optimized)

Memory Memtable (skiplist) flush Disk SSTable 1 (sorted, immutable) SSTable 2 SSTable 3 ↻ compaction merges SSTables Cassandra · RocksDB · LevelDB Best for: writes (sequential I/O)
8 Data Structures That Power Databases

1. Skiplist

In-memory · O(log N) search/insert · Multiple levels act as "express lanes" — skip nodes for fast traversal. Redis sorted sets (ZADD/ZRANGE).

2. Hash Index

0 1 2 as btc jobs twitter
In-memory · O(1) average · Bucket array + chaining for collisions. Most common in-memory index solution.

3. SSTable

Index file aaa offset:0 len:4 aab offset:4 len:7 zzz offset:3132 u b e r t w i t t e r m blob file (sorted data)
Disk-based · Sorted String Table — immutable sorted key-value file. Index maps keys→offsets. Building block of LSM trees. Seldom used alone.

4. WAL

Write WAL then DB Log first → apply later → replay on crash
Durability · Write-Ahead Log — every change logged before applying. On crash, replay to recover. Postgres, MySQL, etcd, Kafka.

5. LSM Tree

Memory Skiplist Disk SSTable 1 SSTable 2 SSTable 3
Memory+Disk · Writes to memtable, flushed to SSTables. Compaction merges. High write throughput. Cassandra, RocksDB. Disk compaction may impact reads.

6. B-Tree / B+Tree

7 16 1·2·5·6 9·11 17·22·23
Disk-based · Most popular DB index. Balanced tree, sorted keys, data in leaves. O(log N). Leaf nodes linked for range scans. Postgres, MySQL InnoDB.

7. Inverted Index

Index is today my "my name is" "What day?" "I bought"
Search · Maps each term → list of documents. BM25 ranking. Powers full-text search in Elasticsearch / Lucene.

8. R-Tree (Spatial)

Multi-dimensional · Hierarchical bounding boxes for spatial data. Nearest neighbor, geo range queries. PostGIS, MongoDB 2dsphere.
Schema: Normalization (3NF, eliminate redundancy, needs JOINs) vs Denormalization (duplicate for read speed, no JOINs). ACID vs BASE: ACID = strong consistency (SQL). BASE = Basically Available, Soft state, Eventually consistent (NoSQL).
DB Locks: Row-level (InnoDB default) · Table-level (MyISAM) · Intent locks (signal intent) · Advisory locks (app-level). MVCC — readers see snapshot, no read locks (Postgres, MySQL InnoDB).

Database Indexing

How indexes speed up reads by minimizing disk I/O — the difference between seconds and milliseconds

Index Types
Index TypeHowUse Case
ClusteredData physically sorted by index key (1 per table)Primary key lookups, range scans
Non-clusteredSeparate structure with pointers to data rowsSecondary lookups (email, name)
CompositeMulti-column index (leftmost prefix rule)WHERE a=1 AND b=2
CoveringAll query columns in the index — no table accessIndex-only scans (fastest reads)
PartialIndex only a subset of rows (WHERE active=true)Sparse data, smaller index size
GINGeneralized Inverted — multi-value keysFull-text search, JSONB, arrays (Postgres)
GiSTGeneralized Search Tree — spatial/rangeGeo queries (PostGIS), range types
BRINBlock Range — summary per block rangeLarge sequential data (time-series, logs)
How Indexing Makes Your DB Faster
The problem: Your data lives on disk, not RAM. Disk reads are 10,000× slower than memory. Worse — disk reads happen at the block level (e.g., 4KB or 8KB chunks). Even reading 1 byte loads the entire block. So the key to fast queries is: read as few disk blocks as possible. That's exactly what an index does — it's a small, sorted lookup table that tells the DB which blocks to read, so it skips the rest.

① The Data: Users Table on Disk

Each row = 200 bytes. Disk block = 600 bytes. So 3 rows fit per block.
id(4B) name(60B) age(4B) bio(128B) = 200B/row 1Alice23... 2Bob30... 3Carol25... 4Dave23... ... 100 rows total On disk (600B blocks): B1 B2 B3 ... B34 3 rows/block · ⌈100/3⌉ = 34 blocks

② The Index: A Tiny Sorted Lookup Table

Only stores the column you search + pointer to the row. Much smaller than data.
age(4B) id(4B) 231 234 253 302 ... 100 entries (sorted) 8B × 100 = 800B → 2 blocks IB1 IB2 vs 34 data blocks Index is 17× smaller than data!
Query: SELECT * FROM users WHERE age = 23
The DB needs to find all users aged 23. Without an index, it has no idea which blocks contain age=23 — so it reads every single block. With an index, it first checks the tiny index to learn exactly which row IDs match, then fetches only those specific blocks.

❌ Without Index — Full Table Scan

DB doesn't know where age=23 lives. Must load and check every block, one by one.
WHERE age=23 B1 B2 B3 B4 ... B33 B34 34 block reads Must read every block to find age=23 Slow — O(N) disk I/O

✓ With Index — Targeted Lookup

Index says "age=23 is at id=1 (block 1) and id=4 (block 2)." DB reads only those 2 blocks.
Step 1: scan index IB1 IB2 → found id=1, id=4 Step 2: fetch rows by id B1 B2 → got Alice, Dave 4 block reads (2 index + 2 data)

④ 100 Rows

No Idx
34 blocks
Indexed
4
8.5× speedup

At Scale: 1M Rows

No Idx
333K blocks
B+Tree
~4
~100,000× speedup

⑤ Key Insights

Block-level reads: Disk loads entire block even for 1 byte — so fewer blocks = faster
Index is tiny: Only stores search column + pointer (8B vs 200B/row) → 17× smaller
Index is sorted: Can binary search or stop early — no need to scan all entries
B+Tree: Real DBs use tree indexes → O(log N) block reads, not O(N)
Trade-off: Indexes speed up reads but slow down writes (must update index on every INSERT/UPDATE)

✓ When to Index

Columns in WHERE clauses (filter conditions)
Columns in JOIN ON (foreign keys)
Columns in ORDER BY / GROUP BY
High-cardinality columns (many unique values)
Read-heavy tables (more reads than writes)

⚠ When NOT to Index

Small tables (full scan is fast enough)
Write-heavy tables (index update overhead)
Low-cardinality columns (boolean, gender — few unique values)
Columns rarely used in queries
Too many indexes → slows INSERT/UPDATE/DELETE
Bottom line: An index is like a book's table of contents — instead of reading every page to find "Chapter 7", you look up the page number in the TOC and jump directly there. At scale, the difference between indexed and unindexed is seconds vs milliseconds.

SQL (PostgreSQL, MySQL)

ACID transactions + complex queries. The default choice when you need consistency

Guarantees: Atomicity — all or nothing. Consistency — constraints(PK, FK, UNIQUE) always enforced. Isolation — locks / MVCC prevent dirty reads. Durability — WAL survives crashes. These guarantees mean partial updates are impossible and committed data is never lost.
Postgres FeatureDetail
MVCCMulti-version — readers see snapshot, writers create new version. No read locks.
IndexesB-Tree (default), GIN (full-text/JSONB), GiST (geo), BRIN (large sequential)
JSONBBinary JSON with indexing — bridge SQL and document model
PartitioningRange/list/hash. Partition pruning speeds queries on large tables.
ExtensionsPostGIS (geo), TimescaleDB (time-series), Citus (distributed), pg_trgm (fuzzy)
Scaling: Read replicas (followers serve reads) · PgBouncer (connection pooling) · Citus/Vitess (sharding) · Vertical (bigger machine)
Limitations: Write bottleneck — single leader (~1-3K writes/sec). Resharding is painful. Rigid schema — ALTER on large tables can lock. Vertical scaling ceiling.
Real-world: Instagram — sharded Postgres. Stripe — Postgres for payments. Supabase — "Firebase on Postgres".

NoSQL

Horizontal scaling, flexible schema, tunable consistency

TypeExamplesGuaranteeUse Case
Key-ValueRedis, DynamoDBO(1) lookup, partition tolerantSessions, cache
DocumentMongoDB, FirestoreSingle-doc ACID, flexible schemaProfiles, catalogs
Wide-ColumnCassandra, HBaseWrite-optimized, tunable consistencyTime-series, IoT
GraphNeo4j, NeptuneO(1) edge traversalSocial, fraud, recommendations
Cassandra Guarantees: Masterless ring — no SPOF. Tunable consistency — QUORUM (W+R>N = strong) or ONE (fast, eventual). Anti-entropy: Read repair, Merkle trees, hinted handoff keep replicas in sync.
Limitations: No complex joins. No multi-row ACID. Eventual consistency by default. Must model around queries. Tombstone overhead.
Real-world: Discord — Cassandra (migrated to ScyllaDB). Netflix — Cassandra for viewing history. Uber — Cassandra for location data.

NewSQL

Global ACID at scale. Spanner (TrueTime + Paxos), CockroachDB (Raft), TiDB

Guarantees: Serializability globally — strongest isolation. SQL interface — familiar tools work. Trade-off: consensus latency (100ms+ multi-region) and 10x cost vs Postgres. Most apps don't need this.

Time-Series DBs

InfluxDB, Prometheus, TimescaleDB — optimized for append-heavy + range queries

Guarantees: High ingest (millions/sec). Auto-retention (TTL cleanup). Downsampling (1s→1min→1hr). Prometheus = pull-based scraping. TimescaleDB = Postgres extension (SQL for time-series).

Blob Storage (S3)

99.999999999% durability (11 nines). Cheap, infinite scale. The default for any unstructured data — images, videos, backups, logs.

Client upload image App Server generate pre-signed URL S3 / Blob Store PUT via pre-signed URL 11 nines durability CDN Edge serve to users Storage Tiers — Automatic Lifecycle Policies Standard $0.023/GB/mo frequent access 30d Infrequent (IA) $0.0125/GB/mo ~1×/month access 90d Glacier $0.004/GB/mo retrieval: minutes 1yr Deep Archive $0.00099/GB/mo retrieval: 12-48 hrs
FeatureHow It WorksUse Case
Pre-signed URLsServer generates time-limited signed URL → client uploads/downloads directly to S3 (no proxy)Large file uploads without loading app server
VersioningEvery overwrite creates a new version. Old versions retained until explicitly deleted.Accidental delete protection, audit trail
Multipart UploadSplit large files into chunks (5MB-5GB each), upload in parallel, assemble on S3Files >100MB, resumable uploads over flaky networks
Event NotificationsS3 triggers Lambda/SQS/SNS on PUT/DELETE eventsAuto-generate thumbnails, trigger transcoding pipeline
Cross-Region ReplicationAsync replicate objects to another region bucketDisaster recovery, compliance (data residency)
Guarantees: Strong read-after-write consistency (S3, since Dec 2020). 11 nines durability = lose 1 object per 100 billion stored per year. Infinite scale — no capacity planning needed. Partition prefixes for >5,500 GET/sec or >3,500 PUT/sec per prefix.
Design patterns: Media uploads — pre-signed URL → S3 → CDN. Data lake — Parquet/ORC files on S3 → query with Athena/Spark. Backups — DB snapshots → S3 → lifecycle to Glacier. Static hosting — HTML/CSS/JS on S3 + CloudFront.

Connection Pooling

Reuse expensive DB connections instead of opening per request

App pod 1 App pod 2 App pod 3 Pooler PgBouncer / HikariCP N reusable conns Postgres 100 max conns
PoolerStackMode
PgBouncerPostgressession / transaction / statement
HikariCPJava / JVMin-process
ProxySQLMySQLproxy + query routing
RDS ProxyAWSmanaged, IAM-aware
Serverless gotcha: 1000 Lambda containers × 5 conns each = 5000 — your DB caps at 100. Always front Lambda with a pooler.

Schema Migrations

Version-controlled, repeatable DB changes

ToolStack
FlywayJVM, polyglot SQL
LiquibaseJVM, XML/YAML changesets
AlembicPython / SQLAlchemy
Prisma MigrateNode / TS
Rails / Django Migrationsbuilt-in ORM
-- V20260510__add_user_email.sql
ALTER TABLE users
  ADD COLUMN email TEXT;
CREATE UNIQUE INDEX users_email_uq
  ON users (email);
Big-table pitfall: ALTER can take an exclusive lock for hours. Use pt-online-schema-change (MySQL) or gh-ost for zero-downtime changes (shadow table + chunked copy + atomic swap).
Expand–contract: add new column → dual-write → backfill → switch reads → drop old. Never break the running app.

Vector Databases

Store + ANN-search high-dimensional embeddings for AI/RAG

Doc / query Embed model OpenAI / BGE Vector [768] cosine sim Vector DB HNSW / IVF Top-K nearest neighbors return semantically similar items
DBNotes
Pinecone / Weaviate / MilvusManaged or OSS, native ANN
pgvectorPostgres extension — reuse existing DB
Qdrant / ChromaLightweight, dev-friendly
OpenSearch / ElasticHybrid lexical + vector
Use cases: semantic search, RAG, dedup, recommendations, image / audio similarity.
ANN Index Algorithms Compared
AlgorithmHow It WorksSpeedAccuracyMemoryBest For
HNSWHierarchical graph — navigate layers from coarse to fineVery fastHigh (95%+)High (in-memory graph)Low-latency serving, <1M vectors
IVF-PQCluster vectors (IVF) + compress with Product QuantizationFastGood (90%+)Low (compressed)Billions of vectors, cost-sensitive
Flat (brute force)Compare query against every vectorSlow (O(n))Perfect (100%)Full vectors in RAMSmall datasets (<100K), ground truth
ScaNNGoogle's anisotropic quantization + tree partitioningVery fastHighMediumGoogle-scale, TensorFlow ecosystem
Embedding dimensions: OpenAI text-embedding-3-small = 1536 dims. Cohere = 1024. BGE/E5 = 768. Image (CLIP) = 512. Each vector = dims × 4 bytes (float32). 1M vectors × 1536 dims = ~6GB RAM. Use PQ compression for 10-50× reduction.
RAG pattern: User query → embed → ANN search top-K docs → feed docs + query to LLM → grounded answer. Vector DB is the retrieval layer that makes LLMs factual.

Graph DB Deep Dive

Index-free adjacency — O(1) hops between connected nodes

Alice Bob Carol Post #42 FOLLOWS FOLLOWS LIKES LIKES
// Cypher: friends-of-friends who liked Post #42
MATCH (me:User {name:'Alice'})-[:FOLLOWS*1..2]->(f:User)
      -[:LIKES]->(p:Post {id:42})
RETURN DISTINCT f.name;
Use forAvoid for
Social graphs, fraud cycles, recsTabular OLAP / aggregates
Knowledge graphs, dependency treesHigh write throughput, BLOBs
Pathfinding, shortest-pathSimple CRUD apps (overkill)
Engines: Neo4j, Amazon Neptune, JanusGraph, ArangoDB (multi-model), TigerGraph (analytics-heavy).
Graph DB Internals — Why O(1) Traversal?
Node Record (fixed size) id: 42 labels: [:User] first_rel_ptr: → rel#7 first_prop_ptr: → prop#3 Direct pointer — no index lookup! Relationship Record id: 7 (FOLLOWS) start_node: → node#42 end_node: → node#99 next_rel_ptr: → rel#12 Linked list of relationships Index-Free Adjacency Each node stores direct pointers to its neighbors. Traversal = follow pointers O(1) per hop regardless of total graph size!
Graph vs Relational for relationships: SQL JOIN on 5-hop social query = 5 table scans, exponential cost. Graph DB = 5 pointer follows, constant cost per hop. At 1M nodes with avg 50 edges: SQL "friends of friends of friends" = seconds. Graph = milliseconds.
When NOT to use Graph DB: Simple CRUD with no relationships. High-volume writes (>100K/sec). Aggregations/analytics over all data (use columnar DB). Storing blobs/documents. If your queries don't traverse relationships, a graph DB adds complexity for no benefit.