What Each Building Block in a System Actually Costs You

What does a load balancer actually give you, and what does it take?

A load balancer sits in front of your application servers and distributes incoming traffic across them. The obvious benefit is that you can run multiple instances of your application and spread the load. The less obvious cost is that you now have a component that must itself be highly available, must route requests consistently when sessions matter, and must be reconfigured every time you scale up or down.

There are two layers where load balancers appear:

Global load balancers operate at the DNS level and route traffic across data centers or regions. Cloudflare, AWS Route 53 with latency routing, and GCP's global load balancer work at this layer. They handle geographic distribution and can route users to the nearest healthy region.
Local load balancers operate within a data center and distribute traffic across application server instances. They handle health checks, connection draining, and sticky sessions. AWS ALB and NGINX are common here.

The routing algorithms matter when your servers are not interchangeable. Round robin sends each request to the next server in sequence. Least connections sends each request to the server with the fewest active connections, which is better when requests vary significantly in duration. IP hash pins a client to a server based on their IP, which is useful for session affinity but breaks down behind NAT.

What you are signing up for: health check configuration, certificate management, connection limits, and the reality that a misconfigured load balancer can become the bottleneck faster than the application servers behind it.

What is the right database, and what does the choice foreclose?

The choice between relational and NoSQL databases is not about which is newer. It is about which consistency and query model matches the data.

Relational databases (PostgreSQL, MySQL) give you ACID transactions by default. A transaction either commits fully or rolls back entirely. Foreign keys enforce referential integrity at the database level. Joins let you query across related tables without denormalizing. These properties are not free: write throughput is lower than most NoSQL systems at equivalent consistency guarantees, and horizontal scaling requires careful schema design or application-level sharding.

NoSQL systems trade some of those guarantees for different properties:

Document stores (MongoDB, DynamoDB) store data as self-contained documents. Reads are fast when all the data you need is in one document. Cross-document queries are expensive or impossible, which pushes you toward denormalization. Schemas are flexible but that flexibility moves validation responsibility into the application.
Key-value stores (Redis, DynamoDB in key-value mode) offer the fastest read and write performance because the access pattern is simple: fetch by key. There is no query language, no joins, no filtering by value. They work well when your access pattern is known in advance.
Wide-column stores (Cassandra, HBase) are optimized for high write throughput and time-series data. Reads are fast only if the query matches the table's partition key and sort key design. Schema design in Cassandra is query-first: you design the table around the query, not the data.

The mistake teams make is choosing a database for its write throughput when the application's real bottleneck is read patterns, or choosing NoSQL for flexibility and then spending months reimplementing joins in application code.

The question worth asking before choosing: what does the worst-case query look like, and can the database execute it without a full scan?

What is a cache actually solving, and when does it stop helping?

A cache stores the result of an expensive operation so that subsequent requests for the same data can be served without repeating the work. The expensive operation is usually a database read, but it can be an API call, a computation, or a rendered page.

The three caching strategies differ in when the cache is populated and what happens on a write:

Cache-aside (lazy loading): the application checks the cache first. On a miss, it reads from the database and writes the result into the cache. The cache is only populated for data that is actually requested. Stale data is possible if the database is updated without invalidating the cache.
Write-through: every write goes to both the cache and the database synchronously. The cache is always consistent with the database, but writes are slower because they must complete in both places.
Write-back (write-behind): writes go to the cache first, and the database is updated asynchronously. Write throughput is high, but there is a window where data in the database is stale. If the cache node fails before the flush, data is lost.

Redis supports more than key-value caching: sorted sets for leaderboards, pub/sub for lightweight messaging, and Lua scripting for atomic operations. Memcached is simpler and faster for pure caching workloads but offers none of those extras.

The part that is easy to underestimate: cache invalidation. When the underlying data changes, the cache must either be invalidated (forcing the next read to repopulate from the database) or updated. Getting this wrong means users see stale data, or worse, the cache and database diverge silently. Cache invalidation errors are responsible for a disproportionate share of subtle production bugs.

A cache does not reduce the total work the system does. It changes who does it and when. If the cache hit rate is low, you are doing the database read and the cache write on every request, which is strictly more expensive than just reading from the database.

What does a message queue change about how a system handles load?

A message queue decouples the component that produces work from the component that does it. The producer writes a message to the queue and continues. The consumer reads from the queue and processes at its own pace. The queue absorbs bursts: if the producer generates 10,000 messages in a minute and the consumer can handle 1,000 per minute, the queue grows by 9,000 messages that minute and shrinks as the consumer catches up.

The two dominant models:

Message queues (RabbitMQ, AWS SQS): each message is consumed by exactly one consumer. The queue tracks which messages have been acknowledged and retries unacknowledged messages. Good for work distribution: send an email, resize an image, process a payment. Once a consumer claims a message, no other consumer sees it.

Event streaming (Kafka): messages are written to a log and retained for a configurable period. Consumers track their own position in the log and can replay messages. Multiple consumers can read the same message independently. Good for event sourcing, audit logs, or feeding multiple downstream systems from the same event stream.

The operational cost of a queue is that you have now introduced a dependency that must be monitored and sized. Queue depth is the metric to watch: if it grows without recovering, your consumer is too slow for the producer's rate and you need more consumer capacity or a rate limit on the producer. A queue that grows unbounded will eventually apply backpressure to producers or run out of storage.

Idempotency matters here. If a consumer crashes after processing a message but before acknowledging it, the queue will redeliver the message. If the operation is not idempotent, the work will be done twice. Teams discover this in production rather than at design time more often than they should.

What does a CDN actually serve, and what does it not solve?

A CDN (content delivery network) caches your static assets at edge nodes distributed geographically. When a user in Jakarta requests an image hosted in us-east-1, without a CDN they wait for a round trip to the US data center. With a CDN, the edge node in Singapore serves the cached image and the latency is an order of magnitude lower.

CDNs are effective for:

Static assets: images, CSS, JavaScript bundles, fonts
Streaming video: HLS segments cached at the edge
API responses that are identical for all users and change infrequently

CDNs are not effective for:

Personalized responses that differ per user
Data that changes frequently enough that the cache TTL has to be very short
Reducing latency for write operations, which must still reach the origin

Cache invalidation at the CDN layer is slower and less granular than at the application layer. Most CDNs let you purge by URL or by tag, but pushing an invalidation to every edge node takes seconds to minutes. For content that must update immediately (breaking news, real-time scores), a CDN cache in front of it will cause stale content to surface unless TTLs are very short or you use a CDN that supports instant purge.

Cloudflare and Akamai sit at the larger end of the market. AWS CloudFront integrates tightly with S3 and ALB, which is convenient if your infrastructure is already on AWS.

What does the rest of the infrastructure stack add?

Beyond the core components, a system needs several services that are easy to underestimate because they are not in the data path of a typical request.

Rate limiters sit in front of APIs and limit how many requests a client can make in a time window. The two common implementations are token bucket (a client accumulates tokens over time and spends one per request, allowing bursts) and fixed window counter (count requests per window, reject above the limit, simpler but susceptible to boundary bursts). Without a rate limiter, a misbehaving client or a DDoS attack can consume all available capacity. With one, you need to decide where the state lives: in memory on each application server (cheap, but rate limits are per-server rather than global) or in a shared Redis instance (globally accurate, but a dependency on Redis for every request).

Unique ID generators matter when you need IDs that are globally unique across distributed writes without a central coordinator. Auto-incrementing database IDs work on a single node but fail under sharding. UUIDs are universally unique but random, which creates index fragmentation on insertion. Twitter's Snowflake and similar approaches generate time-ordered, sortable IDs that embed a timestamp, a worker ID, and a sequence number. The ordering property is useful for pagination and debugging.

Search systems (Elasticsearch, OpenSearch) are separate from databases because the query model is fundamentally different. Full-text search requires inverted indexes, relevance scoring, and tokenization pipelines that relational databases handle poorly at scale. The cost is eventual consistency with the primary data store: documents are indexed asynchronously, so a write to the database does not appear in search results immediately.

Logging and monitoring are the components that make the rest of the system observable. Without structured logs, debugging a distributed system means guessing which of thirty machines handled a specific request. Without metrics, you cannot see queue depth rising, cache hit rate falling, or error rate climbing until a user reports it. Teams consistently underprice the operational value of good observability until they are debugging a production incident at 2 a.m.

The tools I keep seeing referenced: Prometheus and Grafana for metrics, the ELK stack (Elasticsearch, Logstash, Kibana) or Datadog for logs, and distributed tracing (Jaeger, OpenTelemetry) for correlating a single request across multiple services.

What does this layer actually cost to operate?

Each building block added to a system is a service that must be deployed, configured, monitored, scaled, and recovered from failure. A system with a database, a cache, a queue, a CDN, a rate limiter, and a search index has six dependencies beyond the application itself. Any of them can fail. All of them need to be sized correctly. Most of them have operational subtleties that only surface under load.

The pattern in post-mortems is not that the individual components failed in unexpected ways. It is that the assumptions between them were wrong: the cache was invalidated more slowly than the write rate required, the queue depth grew faster than the consumer scaling policy triggered, the search index fell behind the primary database and queries returned stale results during a high-write period.

The question I am trying to hold while working through each building block is not "what does this component do?" but "what does adding this component assume about everything else in the system?" That is the question that determines whether a design ages well or surprises you in production.