Rate Limiting Is Not Optional and Most Implementations Are Wrong
Rate limiting is one of the few API design decisions where the failure mode is existential rather than merely inconvenient. An API without rate limiting is an API that can be brought down by a single misbehaving consumer, whether that consumer is a customer with a buggy retry loop, a competitor running a data extraction operation, or an attacker attempting a denial of service. The argument for implementing rate limiting is not about fairness or monetization tiers, though it serves both. It is about operational survival.
The implementation errors that make rate limiting ineffective are consistent enough across organizations to describe as a failure pattern. Teams implement rate limits they have not tested under load. They apply limits at the wrong granularity. They configure error responses that consumers cannot interpret correctly. They fail to account for distributed systems behavior in their counting logic. The rate limiting exists on paper. The protection it provides is incomplete.
Algorithm Choice and Its Consequences
The token bucket algorithm — where each consumer has a bucket that fills at a constant rate and each request consumes a token — is the most commonly deployed rate limiting approach and the most forgiving for bursty legitimate traffic. A consumer that makes no requests for ten minutes accumulates tokens during that period and can spend them in a brief burst without being throttled. This matches the actual usage patterns of most API consumers better than algorithms that enforce strict per-second limits without allowance for bursts.
The fixed window algorithm — counting requests within a fixed time window and resetting the counter at the window boundary — has a known failure mode: a consumer that maximizes requests at the end of one window and the beginning of the next can effectively double the nominal rate limit for a brief period. The sliding window algorithm addresses this but requires storing more state per consumer, which becomes significant at scale.
The choice of algorithm affects how the rate limit behaves under legitimate high-traffic conditions. A system that throttles valid consumers during traffic spikes because the rate limiting algorithm does not account for burst patterns will generate support tickets and erode consumer trust. Testing the rate limiting behavior under realistic traffic patterns before launch is not optional.
Granularity Errors
Rate limiting at the wrong granularity is as bad as not rate limiting. An API that applies rate limits per account but not per endpoint can be abused through a single expensive endpoint even when the account-level limit appears reasonable. An API that applies limits per IP address without considering shared NAT environments — corporate offices and mobile networks where many users share a single IP — will throttle legitimate users based on the behavior of others behind the same address.
The correct granularity depends on the threat model and the consumer population. Public APIs consumed by individual developers should limit per API key. APIs consumed by enterprise accounts should limit per account with per-endpoint sub-limits for expensive operations. APIs that authenticate users within the requesting application should limit per authenticated user, not per application key.
Error Response Quality
The HTTP 429 status code — Too Many Requests — is the correct response when a rate limit is exceeded. The Retry-After header should accompany every 429 response, specifying either the number of seconds until the consumer can retry or the timestamp at which the limit resets. A 429 without retry information forces the consumer to implement exponential backoff against an unknown reset time, producing retry storms that increase the load on an already-throttled system.
The X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers — not standardized but widely adopted — allow well-behaved consumers to manage their request rates proactively rather than reactively. Consumers that can see their remaining request budget do not need to discover the limit by hitting it. Surfacing this information costs nothing and substantially improves the consumer experience.
Distributed Counting
Rate limiting in a distributed system — multiple API server instances behind a load balancer — requires a shared counting mechanism. Each server counting requests independently will enforce the rate limit per instance rather than per consumer globally, effectively multiplying the nominal limit by the number of server instances.
Redis is the standard shared counter for distributed rate limiting. The implementation must handle the race condition between checking the count and incrementing it — a Redis Lua script or the INCR and EXPIRE combination provides the atomicity required. The latency of the Redis lookup is the cost of correctness. For most APIs this cost is acceptable. For latency-sensitive APIs where the rate limiting check is on the critical path, co-locating the Redis instance and tuning the connection pool are the appropriate optimizations.
Rate limiting done correctly is invisible to well-behaved consumers and a hard wall for misbehaving ones. Done incorrectly, it creates friction for legitimate use while providing inadequate protection against abuse. The implementation deserves the same rigor as the API it protects.