Choose based on access patterns, guarantees, and scale
Need
Choose
Why
Examples
ACID + complex joins
SQL
Strong consistency, referential integrity
Postgres, MySQL, Aurora
High write throughput
Wide-Column
LSM-tree, masterless, tunable consistency
Cassandra, ScyllaDB
Flexible schema
Document
JSON-like, single-doc ACID
MongoDB, Firestore
Sub-ms key lookup
Key-Value
O(1) get/put, in-memory option
Redis, DynamoDB
Relationship traversal
Graph
Fast traversal without joins
Neo4j, Neptune
Full-text search
Search Engine
Inverted index, BM25 ranking
Elasticsearch, OpenSearch
Global ACID at scale
NewSQL
SQL + distributed consensus
Spanner, CockroachDB
Analytics (OLAP)
Columnar
Column-oriented, fast aggregations
BigQuery, ClickHouse, Apache Pinot
Database Internals
Data structures that power your databases — how they store, index, and retrieve data
▸ Core Storage Engines
B+Tree (Read-Optimized)
LSM Tree (Write-Optimized)
▸ 8 Data Structures That Power Databases
1. Skiplist
In-memory · O(log N) search/insert · Multiple levels act as "express lanes" — skip nodes for fast traversal. Redis sorted sets (ZADD/ZRANGE).
2. Hash Index
In-memory · O(1) average · Bucket array + chaining for collisions. Most common in-memory index solution.
3. SSTable
Disk-based · Sorted String Table — immutable sorted key-value file. Index maps keys→offsets. Building block of LSM trees. Seldom used alone.
4. WAL
Durability · Write-Ahead Log — every change logged before applying. On crash, replay to recover. Postgres, MySQL, etcd, Kafka.
5. LSM Tree
Memory+Disk · Writes to memtable, flushed to SSTables. Compaction merges. High write throughput. Cassandra, RocksDB. Disk compaction may impact reads.
6. B-Tree / B+Tree
Disk-based · Most popular DB index. Balanced tree, sorted keys, data in leaves. O(log N). Leaf nodes linked for range scans. Postgres, MySQL InnoDB.
7. Inverted Index
Search · Maps each term → list of documents. BM25 ranking. Powers full-text search in Elasticsearch / Lucene.
8. R-Tree (Spatial)
Multi-dimensional · Hierarchical bounding boxes for spatial data. Nearest neighbor, geo range queries. PostGIS, MongoDB 2dsphere.
Schema:Normalization (3NF, eliminate redundancy, needs JOINs) vs Denormalization (duplicate for read speed, no JOINs). ACID vs BASE: ACID = strong consistency (SQL). BASE = Basically Available, Soft state, Eventually consistent (NoSQL).
DB Locks:Row-level (InnoDB default) · Table-level (MyISAM) · Intent locks (signal intent) · Advisory locks (app-level). MVCC — readers see snapshot, no read locks (Postgres, MySQL InnoDB).
Database Indexing
How indexes speed up reads by minimizing disk I/O — the difference between seconds and milliseconds
▸ Index Types
Index Type
How
Use Case
Clustered
Data physically sorted by index key (1 per table)
Primary key lookups, range scans
Non-clustered
Separate structure with pointers to data rows
Secondary lookups (email, name)
Composite
Multi-column index (leftmost prefix rule)
WHERE a=1 AND b=2
Covering
All query columns in the index — no table access
Index-only scans (fastest reads)
Partial
Index only a subset of rows (WHERE active=true)
Sparse data, smaller index size
GIN
Generalized Inverted — multi-value keys
Full-text search, JSONB, arrays (Postgres)
GiST
Generalized Search Tree — spatial/range
Geo queries (PostGIS), range types
BRIN
Block Range — summary per block range
Large sequential data (time-series, logs)
▸ How Indexing Makes Your DB Faster
The problem: Your data lives on disk, not RAM. Disk reads are 10,000× slower than memory. Worse — disk reads happen at the block level (e.g., 4KB or 8KB chunks). Even reading 1 byte loads the entire block. So the key to fast queries is: read as few disk blocks as possible. That's exactly what an index does — it's a small, sorted lookup table that tells the DB which blocks to read, so it skips the rest.
① The Data: Users Table on Disk
Each row = 200 bytes. Disk block = 600 bytes. So 3 rows fit per block.
② The Index: A Tiny Sorted Lookup Table
Only stores the column you search + pointer to the row. Much smaller than data.
▸ Query: SELECT * FROM users WHERE age = 23
The DB needs to find all users aged 23. Without an index, it has no idea which blocks contain age=23 — so it reads every single block. With an index, it first checks the tiny index to learn exactly which row IDs match, then fetches only those specific blocks.
❌ Without Index — Full Table Scan
DB doesn't know where age=23 lives. Must load and check every block, one by one.
✓ With Index — Targeted Lookup
Index says "age=23 is at id=1 (block 1) and id=4 (block 2)." DB reads only those 2 blocks.
④ 100 Rows
No Idx
34 blocks
Indexed
4
8.5× speedup
At Scale: 1M Rows
No Idx
333K blocks
B+Tree
~4
~100,000× speedup
⑤ Key Insights
▸Block-level reads: Disk loads entire block even for 1 byte — so fewer blocks = faster ▸Index is tiny: Only stores search column + pointer (8B vs 200B/row) → 17× smaller ▸Index is sorted: Can binary search or stop early — no need to scan all entries ▸B+Tree: Real DBs use tree indexes → O(log N) block reads, not O(N) ▸Trade-off: Indexes speed up reads but slow down writes (must update index on every INSERT/UPDATE)
✓ When to Index
▸ Columns in WHERE clauses (filter conditions) ▸ Columns in JOIN ON (foreign keys) ▸ Columns in ORDER BY / GROUP BY ▸ High-cardinality columns (many unique values) ▸ Read-heavy tables (more reads than writes)
⚠ When NOT to Index
▸ Small tables (full scan is fast enough) ▸ Write-heavy tables (index update overhead) ▸ Low-cardinality columns (boolean, gender — few unique values) ▸ Columns rarely used in queries ▸ Too many indexes → slows INSERT/UPDATE/DELETE
Bottom line: An index is like a book's table of contents — instead of reading every page to find "Chapter 7", you look up the page number in the TOC and jump directly there. At scale, the difference between indexed and unindexed is seconds vs milliseconds.
SQL (PostgreSQL, MySQL)
ACID transactions + complex queries. The default choice when you need consistency
Guarantees:Atomicity — all or nothing. Consistency — constraints(PK, FK, UNIQUE) always enforced. Isolation — locks / MVCC prevent dirty reads. Durability — WAL survives crashes. These guarantees mean partial updates are impossible and committed data is never lost.
Postgres Feature
Detail
MVCC
Multi-version — readers see snapshot, writers create new version. No read locks.
Limitations:Write bottleneck — single leader (~1-3K writes/sec). Resharding is painful. Rigid schema — ALTER on large tables can lock. Vertical scaling ceiling.
Real-world:Instagram — sharded Postgres. Stripe — Postgres for payments. Supabase — "Firebase on Postgres".
Cassandra Guarantees:Masterless ring — no SPOF. Tunable consistency — QUORUM (W+R>N = strong) or ONE (fast, eventual). Anti-entropy: Read repair, Merkle trees, hinted handoff keep replicas in sync.
Limitations:No complex joins. No multi-row ACID. Eventual consistency by default. Must model around queries. Tombstone overhead.
Real-world:Discord — Cassandra (migrated to ScyllaDB). Netflix — Cassandra for viewing history. Uber — Cassandra for location data.
NewSQL
Global ACID at scale. Spanner (TrueTime + Paxos), CockroachDB (Raft), TiDB
Guarantees:Serializability globally — strongest isolation. SQL interface — familiar tools work. Trade-off: consensus latency (100ms+ multi-region) and 10x cost vs Postgres. Most apps don't need this.
Time-Series DBs
InfluxDB, Prometheus, TimescaleDB — optimized for append-heavy + range queries
Inverted index + BM25 ranking + fuzzy matching. NOT a primary data store — always reindex from source DB
Apache Lucene is the core search library that does the actual work — it builds inverted indexes, runs queries, scores results with BM25, and manages segments on disk. Elasticsearch wraps Lucene with a distributed layer: sharding, replication, REST API, cluster management. Solr is another wrapper. When you search in ES, every shard is a self-contained Lucene index.
▸ Why Not Just Query the DB?
❌ Without Inverted Index
✓ With Inverted Index
▸ How an Inverted Index Works (Lucene's Core)
▸ What's Inside a Posting List?
Data
What
Enables
Doc ID
Which documents contain this term
Basic search — find matching docs
Term Frequency
How many times term appears in doc
Relevance ranking — more mentions = more relevant
Position
Where in the doc the term appears (word #)
Phrase queries — "fish climbing" (adjacent words)
Offset
Character start/end position in original text
Highlighting — bold matched words in snippets
▸ Text Processing Pipeline (Lucene Analyzer)
▸ Ranking: How Results Are Scored (BM25)
Not all matches are equal. BM25 scores each document by: Term Frequency (more mentions = more relevant, with diminishing returns) × Inverse Document Frequency (rare terms matter more — "fish" in 2 of 1M docs scores higher than "the" in 999K docs) × Document Length (shorter docs with the term rank higher — a 10-word doc mentioning "fish" 3× is more focused than a 10K-word doc mentioning it 3×).
▸ Inside Lucene: Segments & Merging
Lucene doesn't update the inverted index in-place. Instead: new docs are written to an in-memory buffer → periodically flushed to an immutable segment on disk (each segment is a mini inverted index). Searches fan out across all segments and merge results. Background merge policy combines small segments into larger ones (like LSM compaction). Deletes are just tombstone markers — physically removed during merge. This is why ES has a refresh interval (default 1s) — new docs aren't searchable until the buffer is flushed to a segment.
▸ Optimizations at Scale
Technique
How
Why
Sorted Posting Lists
Doc IDs stored in sorted order
Efficient merge/intersection — O(n) for AND queries
Delta Encoding
Store gaps between IDs, not absolute IDs
Compress posting lists — critical at Google scale
Tiered / Champion Lists
Keep top-N docs per term in memory
Fast access for frequent queries — rest on disk
Positional Index
Store exact word positions in each doc
Phrase queries ("fish climbing") and proximity search
N-gram Indexing
Index bigrams/trigrams ("new york")
Multi-word search, autocomplete, fuzzy matching
Sharding
Split index across nodes (each shard = Lucene index)
Horizontal scale — scatter-gather across shards
▸ Architecture: ES wraps Lucene
Guarantees:Sub-second full-text search across billions of docs. Shard resilience — replicas survive node failure. Not ACID — eventually consistent (refresh interval delay). Not a source of truth — must reindex from primary DB.
Real-world:Wikipedia (full-text search), GitHub (code search across 200M+ repos), Uber (trip search), Netflix (log analysis via ELK). Use search_after for deep pagination (not from/size).
Blob Storage (S3)
99.999999999% durability (11 nines). Cheap, infinite scale. The default for any unstructured data — images, videos, backups, logs.
Feature
How It Works
Use Case
Pre-signed URLs
Server generates time-limited signed URL → client uploads/downloads directly to S3 (no proxy)
Large file uploads without loading app server
Versioning
Every overwrite creates a new version. Old versions retained until explicitly deleted.
Accidental delete protection, audit trail
Multipart Upload
Split large files into chunks (5MB-5GB each), upload in parallel, assemble on S3
Files >100MB, resumable uploads over flaky networks
Guarantees:Strong read-after-write consistency (S3, since Dec 2020). 11 nines durability = lose 1 object per 100 billion stored per year. Infinite scale — no capacity planning needed. Partition prefixes for >5,500 GET/sec or >3,500 PUT/sec per prefix.
Design patterns:Media uploads — pre-signed URL → S3 → CDN. Data lake — Parquet/ORC files on S3 → query with Athena/Spark. Backups — DB snapshots → S3 → lifecycle to Glacier. Static hosting — HTML/CSS/JS on S3 + CloudFront.
Connection Pooling
Reuse expensive DB connections instead of opening per request
Pooler
Stack
Mode
PgBouncer
Postgres
session / transaction / statement
HikariCP
Java / JVM
in-process
ProxySQL
MySQL
proxy + query routing
RDS Proxy
AWS
managed, IAM-aware
Serverless gotcha: 1000 Lambda containers × 5 conns each = 5000 — your DB caps at 100. Always front Lambda with a pooler.
Schema Migrations
Version-controlled, repeatable DB changes
Tool
Stack
Flyway
JVM, polyglot SQL
Liquibase
JVM, XML/YAML changesets
Alembic
Python / SQLAlchemy
Prisma Migrate
Node / TS
Rails / Django Migrations
built-in ORM
-- V20260510__add_user_email.sql
ALTER TABLE users
ADD COLUMN email TEXT;
CREATE UNIQUE INDEX users_email_uq
ON users (email);
Big-table pitfall:ALTER can take an exclusive lock for hours. Use pt-online-schema-change (MySQL) or gh-ost for zero-downtime changes (shadow table + chunked copy + atomic swap).
Expand–contract: add new column → dual-write → backfill → switch reads → drop old. Never break the running app.
Vector Databases
Store + ANN-search high-dimensional embeddings for AI/RAG
RAG pattern: User query → embed → ANN search top-K docs → feed docs + query to LLM → grounded answer. Vector DB is the retrieval layer that makes LLMs factual.
Graph DB Deep Dive
Index-free adjacency — O(1) hops between connected nodes
// Cypher: friends-of-friends who liked Post #42
MATCH (me:User {name:'Alice'})-[:FOLLOWS*1..2]->(f:User)
-[:LIKES]->(p:Post {id:42})
RETURN DISTINCT f.name;
Graph vs Relational for relationships: SQL JOIN on 5-hop social query = 5 table scans, exponential cost. Graph DB = 5 pointer follows, constant cost per hop. At 1M nodes with avg 50 edges: SQL "friends of friends of friends" = seconds. Graph = milliseconds.
When NOT to use Graph DB: Simple CRUD with no relationships. High-volume writes (>100K/sec). Aggregations/analytics over all data (use columnar DB). Storing blobs/documents. If your queries don't traverse relationships, a graph DB adds complexity for no benefit.