Executive Summary
System design in 2026 reflects a mature cloud-native landscape. 90% of organizations use cloud infrastructure, 78% run containerized workloads, and 85% use CDNs. PostgreSQL and Redis dominate their categories. Apache Kafka is the standard for event streaming. The key challenge has shifted from building scalable systems to operating them efficiently: observability, cost optimization, and reliability engineering are the focus areas.
90%
Cloud adoption
85%
CDN usage
78%
Containerized
42%
Serverless
Part 1: Scaling Fundamentals
Vertical scaling adds resources to one machine (more CPU, RAM). Simple but limited. Horizontal scaling adds more machines. Requires: stateless services, load balancing, distributed data. Start vertical, go horizontal when limits are hit. Most web applications should design for horizontal scaling from the start with stateless services and a load balancer.
Infrastructure Adoption (2018-2026)
Source: OnlineTools4Free Research
Part 2: Load Balancing
Load balancers distribute traffic across servers. Layer 4 (TCP) routes by IP/port. Layer 7 (HTTP) routes by URL, headers, cookies. Algorithms: round-robin (simple), least connections (efficient), IP hash (sticky sessions). NGINX and HAProxy dominate software load balancing. Cloud: AWS ALB/NLB, GCP Load Balancer.
Load Balancer Comparison (2026)
6 rows
| Load Balancer | Type | Algorithms | Best For |
|---|---|---|---|
| NGINX | Software/Reverse Proxy | Round-robin, least-conn, ip-hash, weighted | High-performance reverse proxy, static serving |
| HAProxy | Software | Round-robin, leastconn, source, uri | TCP/HTTP load balancing, health checking |
| AWS ALB | Managed (AWS) | Round-robin, least outstanding requests | AWS HTTP/HTTPS load balancing |
| AWS NLB | Managed (AWS) | Flow hash | Ultra-low latency TCP/UDP, static IP |
| Cloudflare LB | Managed (Global) | Round-robin, geo, weighted | Global load balancing with CDN |
| Envoy | Software/Service Mesh | Round-robin, least-request, ring-hash | Service mesh, gRPC, observability |
Part 3: Caching Strategies
Caching stores frequently accessed data in fast storage (memory). Levels: browser cache, CDN edge cache, application cache (Redis), database query cache. Cache-aside is the most common pattern: check cache, fetch from DB on miss, write to cache. Redis is the standard caching solution. Cache invalidation is the hardest problem in computer science.
Caching Strategies Comparison
5 rows
| Strategy | Description | Consistency | Best For |
|---|---|---|---|
| Cache-Aside (Lazy Loading) | App checks cache first. On miss, fetches from DB and writes to cache. Most common pattern. | Eventual (stale reads possible) | Read-heavy workloads, general purpose |
| Write-Through | App writes to cache and DB simultaneously. Cache is always up-to-date. | Strong | Write-heavy with read requirements |
| Write-Behind (Write-Back) | App writes to cache only. Cache asynchronously writes to DB. Higher throughput. | Eventual (data loss risk) | Very high write throughput |
| Read-Through | Cache sits between app and DB. On miss, cache fetches from DB automatically. | Eventual | Simplifying cache logic in app code |
| Refresh-Ahead | Cache proactively refreshes entries before they expire based on access patterns. | Eventual (fresher) | Predictable access patterns, hot data |
Part 4: Database Selection
Database selection depends on data model, consistency requirements, query patterns, and scale. PostgreSQL is the default for most applications (ACID, JSON, full-text search). MongoDB for flexible schemas. Redis for caching and real-time data. Elasticsearch for search. Kafka for event streaming. DynamoDB for serverless. Most applications start with PostgreSQL and add specialized databases as needed.
Database Comparison (2026)
8 rows
| Database | Type | Consistency | Scalability | Best For |
|---|---|---|---|---|
| PostgreSQL | Relational | ACID | Vertical + read replicas | General purpose, complex queries, JSONB |
| MySQL | Relational | ACID | Vertical + replicas + Vitess | Web applications, WordPress, read-heavy |
| MongoDB | Document | Tunable | Horizontal (sharding) | Flexible schema, rapid development, JSON data |
| Redis | Key-Value / Cache | Eventual (replicas) | Cluster mode (horizontal) | Caching, sessions, real-time leaderboards |
| Elasticsearch | Search Engine | Near real-time | Horizontal (shards) | Full-text search, log analytics, faceted search |
| DynamoDB | Key-Value / Document | Eventual or strong | Horizontal (managed) | AWS serverless, single-digit ms latency |
| Cassandra | Wide Column | Tunable (AP) | Horizontal (multi-DC) | Write-heavy, time-series, geo-distributed |
| ClickHouse | Columnar/OLAP | Eventual | Horizontal | Analytics, real-time aggregation, log analysis |
Part 5: Message Queues
Message queues enable async communication between services. Benefits: decoupling, buffering traffic spikes, reliability (broker persists messages). Kafka for event streaming (ordered, durable, replayable). RabbitMQ for task queues (flexible routing). SQS for simple AWS-native queuing. Choose based on: ordering requirements, throughput needs, and operational complexity tolerance.
Part 6: Core System Design Concepts
CAP theorem: distributed systems choose between consistency and availability during partitions. Consistent hashing minimizes key redistribution when nodes change. Database sharding distributes data horizontally. Read replicas scale reads. Rate limiting protects services. Circuit breakers prevent cascading failures. Each concept addresses a specific scalability or reliability challenge.
System Design Concepts Reference
10 rows
| Concept | Category | Description |
|---|---|---|
| Horizontal Scaling | Scaling | Adding more machines to handle increased load. Requires stateless services, load balancing, and distributed data. Preferred over vertical scaling for web applications. |
| Vertical Scaling | Scaling | Adding more CPU, RAM, or storage to an existing machine. Simpler but has hardware limits. Good for databases that are hard to distribute (PostgreSQL). |
| CDN (Content Delivery Network) | Caching | Geographically distributed cache for static and dynamic content. Reduces latency by serving from edge locations near users. Providers: Cloudflare, AWS CloudFront, Fastly, Akamai. |
| Message Queue | Async | Asynchronous communication between services. Decouples producers and consumers. Handles traffic spikes. Tools: Kafka, RabbitMQ, SQS, NATS. Essential for event-driven architecture. |
| Database Sharding | Data | Distributing data across multiple database instances by a shard key. Enables horizontal scaling of databases. Challenges: cross-shard queries, rebalancing, shard key selection. |
| Read Replicas | Data | Copies of the primary database that serve read queries. Write to primary, read from replicas. Eventual consistency. Simple way to scale read-heavy workloads without sharding. |
| Rate Limiting | Protection | Limiting requests per client per time window. Algorithms: token bucket, sliding window, fixed window. Prevents abuse and protects services. HTTP 429 Too Many Requests. |
| Circuit Breaker | Resilience | Stop calling a failing downstream service. States: Closed (normal), Open (fail fast), Half-Open (trial). Prevents cascading failures. Tools: Resilience4j, Polly, Envoy. |
| Consistent Hashing | Distribution | A hashing scheme that minimizes key redistribution when nodes are added/removed. Used by: distributed caches (Redis Cluster), load balancers, CDNs. Only K/N keys need to move when adding a node. |
| CAP Theorem | Theory | A distributed system can provide at most two of three guarantees: Consistency (all nodes see the same data), Availability (every request gets a response), Partition tolerance (system works despite network partitions). In practice, you choose CP or AP. |
Infrastructure Trends (2022-2026)
Source: OnlineTools4Free Research
Part 7: Best Practices
Architecture: start simple, scale when needed. Use CDN for static content. Cache aggressively (Redis). Design stateless services for horizontal scaling. Use message queues for async processing. Database: start with PostgreSQL, add read replicas before sharding. Reliability: implement health checks, circuit breakers, graceful degradation. Operations: monitor with RED/USE methods, set SLOs, maintain error budgets.
Glossary (50 Terms)
Load Balancer
NetworkingDistributes incoming network traffic across multiple servers to ensure no single server is overwhelmed. Algorithms: round-robin, least connections, IP hash, weighted. Layer 4 (TCP) or Layer 7 (HTTP). Software: NGINX, HAProxy. Managed: AWS ALB/NLB, GCP LB.
CDN
CachingContent Delivery Network: geographically distributed servers that cache and serve content from edge locations near users. Reduces latency, offloads origin servers, and protects against DDoS. Static (images, CSS, JS) and dynamic (API acceleration). Providers: Cloudflare, CloudFront, Fastly.
Caching
PerformanceStoring frequently accessed data in fast storage (memory) to reduce database load and latency. Levels: browser cache, CDN, application cache (Redis), database query cache. Strategies: cache-aside, write-through, write-behind. Invalidation is the hardest problem.
Database Sharding
DataPartitioning data across multiple database instances. Each shard holds a subset of data determined by a shard key. Enables horizontal scaling of databases. Challenges: cross-shard queries, hot spots, rebalancing. Tools: Vitess (MySQL), Citus (PostgreSQL).
Read Replica
DataA copy of the primary database that serves read queries. Writes go to the primary, replicas sync asynchronously. Eventual consistency (slight lag). Simple scaling for read-heavy workloads. All major databases support replicas.
Message Queue
MessagingAsync communication between services. Producer sends messages, consumer processes them. Decouples services, handles spikes, enables retry. Types: point-to-point (SQS), pub/sub (Kafka). Tools: Kafka, RabbitMQ, SQS, NATS.
Horizontal Scaling
ScalingAdding more machines to handle increased load. Requires: stateless services, load balancing, distributed data. More resilient than vertical scaling (no single point of failure). Standard approach for web applications.
Vertical Scaling
ScalingAdding more resources (CPU, RAM) to an existing machine. Simpler but has hardware limits and creates a single point of failure. Good for: databases hard to distribute, initial scaling.
CAP Theorem
TheoryIn a distributed system, you can guarantee at most two of: Consistency (all reads return latest write), Availability (every request gets a response), Partition tolerance (system works during network partitions). Since partitions are inevitable, choose CP (consistent but may be unavailable) or AP (available but may return stale data).
Consistent Hashing
DistributionA hashing scheme where adding/removing nodes requires redistributing only K/N keys (K=keys, N=nodes). Nodes are placed on a hash ring. Keys map to the next node clockwise. Used by: Redis Cluster, DynamoDB, CDNs, distributed caches.
Rate Limiting
ProtectionControlling request frequency per client. Algorithms: token bucket (allows burst), sliding window (precise), fixed window (simple). Return HTTP 429. Implement at: API gateway, load balancer, or application. Prevents abuse and protects backend services.
Circuit Breaker
ResilienceStops calling failing services to prevent cascading failures. Closed (normal) -> Open (fail fast) -> Half-Open (trial). Trips when failure rate exceeds threshold. Combine with retry, timeout, and fallback patterns.
ACID
DataDatabase transaction properties: Atomicity (all or nothing), Consistency (valid state transitions), Isolation (concurrent transactions do not interfere), Durability (committed data survives crashes). Standard for relational databases. NoSQL databases often relax ACID for scalability.
BASE
DataAlternative to ACID for distributed systems: Basically Available (system is available), Soft state (state may change without input), Eventually consistent (system converges to consistent state). Used by NoSQL databases, event-driven systems.
Reverse Proxy
NetworkingA server that sits in front of web servers and forwards client requests. Provides: load balancing, SSL termination, caching, compression, and security (hiding origin servers). NGINX, HAProxy, Caddy, Traefik. Every production deployment should use one.
DNS
NetworkingDomain Name System: translates domain names to IP addresses. DNS-based load balancing distributes traffic geographically. TTL controls cache duration. DNS failover for disaster recovery. Services: Route 53, Cloudflare DNS, Google Cloud DNS.
API Gateway
NetworkingSingle entry point for API requests. Handles: routing, auth, rate limiting, transformation, caching. Decouples clients from backend topology. Tools: Kong, AWS API Gateway, Traefik, Envoy.
Idempotency
DesignAn operation producing the same result regardless of repetition count. GET, PUT, DELETE are naturally idempotent. POST is not. Use idempotency keys for non-idempotent operations to prevent duplicates.
Eventual Consistency
ConsistencyA consistency model where updates propagate asynchronously. The system converges to a consistent state over time. Standard in distributed systems, NoSQL databases, and event-driven architectures. Trade-off: higher availability at the cost of temporary staleness.
Strong Consistency
ConsistencyEvery read returns the most recent write. Requires synchronous replication or consensus (Raft, Paxos). Lower availability during partitions. Used by: relational databases, distributed consensus systems (etcd, ZooKeeper).
Partitioning
DataDividing data across multiple nodes. Horizontal (sharding): different rows on different nodes. Vertical: different columns on different nodes. Range-based: partition by key range. Hash-based: partition by hash of key. Enables horizontal scaling.
Replication
DataCopying data across multiple nodes for redundancy and read scaling. Single-leader: one primary handles writes, replicas sync. Multi-leader: multiple nodes accept writes (conflict resolution needed). Leaderless: any node accepts reads/writes (quorum-based).
Consensus
Distributed SystemsAgreement among distributed nodes on a single value. Algorithms: Raft (etcd, Consul), Paxos (Google Spanner), Zab (ZooKeeper). Used for: leader election, distributed locks, configuration management. Requires majority of nodes (quorum) to agree.
Bloom Filter
Data StructureA probabilistic data structure that tests whether an element is in a set. False positives possible, false negatives impossible. Very space-efficient. Used by: databases (check if key exists before disk read), CDNs, spam filters, web crawlers.
Write-Ahead Log (WAL)
DataA log where all changes are written before being applied to the database. Enables crash recovery (replay log after crash), replication (send log to replicas), and CDC (read log for change events). Used by PostgreSQL, MySQL, Kafka.
Leader Election
Distributed SystemsThe process of choosing one node as the leader (primary) in a distributed system. The leader handles writes or coordination. If the leader fails, a new election occurs. Algorithms: Raft, Paxos, Bully. Tools: etcd, ZooKeeper, Consul.
Back Pressure
ResilienceA flow control mechanism where a system signals upstream that it cannot handle more load. Prevents overwhelming downstream services. Implementation: reject requests (429), queue and process slowly, reduce producer rate. Essential for streaming systems.
Data Lake
DataA centralized repository for structured and unstructured data at any scale. Store raw data and transform on read (schema-on-read). Technologies: S3, HDFS, Delta Lake, Apache Iceberg. Used for: analytics, ML training, data science exploration.
Data Warehouse
DataA system optimized for analytical queries on structured data. Schema-on-write, columnar storage, pre-aggregated. Technologies: Snowflake, BigQuery, Redshift, ClickHouse. Used for: business intelligence, reporting, dashboards.
Event Sourcing
PatternStoring state changes as immutable events rather than current state. Current state derived by replaying events. Benefits: full audit trail, temporal queries, event replay. Challenges: schema evolution, snapshot management.
CQRS
PatternCommand Query Responsibility Segregation: separate models for reads and writes. Write model handles commands, read model optimized for queries. Can use different databases. Benefits: independent scaling, optimized data models.
Service Discovery
InfrastructureMechanism for services to find each other in dynamic environments. Client-side: query registry (Consul, Eureka). Server-side: load balancer routes (K8s Service). DNS-based: service names resolve to IPs.
Observability
OperationsUnderstanding system state from external outputs. Three pillars: metrics (Prometheus), logs (ELK/Loki), traces (Jaeger/Tempo). OpenTelemetry unifies instrumentation. Essential for operating distributed systems.
SLO/SLI/SLA
OperationsSLI (Service Level Indicator): measurable metric (latency, availability). SLO (Service Level Objective): target for SLI (99.9% availability). SLA (Service Level Agreement): contractual commitment with consequences. Error budget = 1 - SLO.
Twelve-Factor App
MethodologyA methodology for building SaaS applications. Principles: codebase in version control, explicit dependencies, config in environment, backing services as resources, build/release/run separation, stateless processes, port binding, concurrency via processes, disposability, dev/prod parity, logs as event streams, admin processes.
Blue-Green Deployment
DeploymentTwo identical environments. Deploy to idle, test, switch traffic. Instant rollback by switching back. Requires double infrastructure during transition.
Canary Deployment
DeploymentDeploy new version to small traffic subset (5%), monitor metrics, gradually increase. Catches issues under real traffic. Tools: Istio, Argo Rollouts, Flagger.
Graceful Degradation
ResilienceSystem continues to function with reduced capability when components fail. Example: show cached product catalog when catalog service is down. Provide fallback responses, disable non-essential features, prioritize core functionality.
Thundering Herd
ProblemMany clients simultaneously requesting the same resource after cache expiration or service recovery. Causes: cache stampede, service restart. Prevention: jitter on cache TTL, request coalescing, circuit breaker, staggered retry with backoff.
Hot Spot
ProblemA node or partition receiving disproportionately more traffic than others. Causes: poor shard key selection (celebrity user, popular product), time-based partitioning during peak hours. Prevention: random suffix on keys, pre-splitting, separate hot data.
Webhook
IntegrationServer pushes notifications to a client URL when events occur. Client registers callback URL; server sends POST on event. Used for: payment confirmations, CI/CD triggers, integrations. Must handle: retries, signature verification, idempotency.
WebSocket
ProtocolFull-duplex communication over a single TCP connection. Bidirectional real-time data. Used for: chat, live dashboards, gaming, collaborative editing. Alternative: SSE (Server-Sent Events) for server-to-client only.
gRPC
ProtocolHigh-performance RPC framework using Protocol Buffers and HTTP/2. Features: bidirectional streaming, code generation, strong typing, smaller payloads than JSON. Used for: internal microservice communication, mobile backends.
Object Storage
StorageStorage for unstructured data (files, images, videos) with flat namespace and HTTP API. Services: AWS S3, GCS, Azure Blob, MinIO (self-hosted). Features: durability (11 nines), versioning, lifecycle policies, event notifications. Standard for storing user uploads and static assets.
Connection Pooling
PerformanceMaintaining a pool of reusable database connections. Avoids the overhead of creating/destroying connections per request. Tools: PgBouncer (PostgreSQL), ProxySQL (MySQL), application-level pools (HikariCP). Essential for performance at scale.
Geo-Replication
DataReplicating data across multiple geographic regions. Benefits: low latency for global users, disaster recovery. Challenges: cross-region consistency, compliance (data residency). Services: CockroachDB, Spanner, DynamoDB Global Tables, Cosmos DB.
Backpressure
ResilienceFlow control mechanism where a system signals upstream that it cannot handle more load. Prevents overwhelming downstream. Implementations: reject with 429, bounded queues, reactive streams (Project Reactor, RxJS).
Content Negotiation
HTTPHTTP mechanism for selecting response format. Client sends Accept header, server responds in requested format. Used for: JSON vs XML, language selection, API versioning via media types.
Health Check
OperationsEndpoint (/health, /ready) reporting service status. Liveness: is the process running? Readiness: can it handle traffic? Used by load balancers and orchestrators for routing and restart decisions.
Distributed Lock
CoordinationA lock mechanism that works across multiple machines. Ensures only one process can access a resource at a time. Implementations: Redis (Redlock), ZooKeeper, etcd. Use sparingly: prefer idempotent designs over distributed locks.
FAQ (20 Questions)
Raw Data Downloads
Citations and Sources
Try These Tools for Free
Put this knowledge into practice with our browser-based tools. No signup needed.
API Tester
Test REST APIs with GET, POST, PUT, DELETE, PATCH. Custom headers, body, response viewer, and session history.
JSON Formatter
Format, validate, and beautify JSON data with syntax highlighting.
Subnet Calc
Calculate network address, broadcast, host range, subnet mask, and number of hosts from IP + CIDR.
Dockerfile Gen
Generate Dockerfiles for Node, Python, Go, Java, Nginx, and Alpine. Configure port, env vars, and commands.
Related Research Reports
Microservices Architecture Guide 2026: Monolith vs Microservices, Service Mesh, CQRS, Saga
The definitive microservices guide for 2026. Monolith vs microservices, modular monolith, service mesh, event-driven, CQRS, saga, DDD. 41 glossary, 15 FAQ. 30,000+ words.
Database Comparison Guide 2026: MySQL vs PostgreSQL vs MongoDB vs Redis vs SQLite vs Supabase
Comprehensive comparison of 6 databases with performance benchmarks, feature matrices, pricing, scalability analysis, ORM compatibility, developer satisfaction data, and use case recommendations for every scenario. 28,000+ words.
The Complete Cloud Computing Guide 2026: AWS vs Azure vs GCP, Serverless, Containers & IaC
The definitive cloud computing reference for 2026. Covers AWS, Azure, GCP service comparisons, serverless architecture, container orchestration, Infrastructure as Code, cost optimization, and multi-cloud strategies. 28,000+ words.
