NC Logo UseToolSuite
Web Security

API Rate Limiting: How It Works and How to Handle It

Understand API rate limiting from both sides — implementing it as a backend developer and handling it gracefully as a consumer. Covers token bucket, sliding window, retry strategies, and common mistakes.

Necmeddin Cunedioglu Necmeddin Cunedioglu

Practice what you learn

UUID Generator

Try it free →

A few months ago, a deploy script I wrote hammered a third-party API with 500 requests in 10 seconds. The API returned 429 Too Many Requests on every request after the first 100, and the script’s error handling was catch(e) { retry() } — which made things exponentially worse. Within 30 seconds, the script had been IP-banned for 24 hours right before a critical deployment window.

Rate limiting is the API world’s traffic management system. Whether you’re building an API or consuming one, understanding how it works saves you from outages, bans, and very stressful on-call incidents.

What Rate Limiting Actually Does

Rate limiting restricts how many API requests a client can make within a time window. If you exceed the limit, the server responds with HTTP 429 Too Many Requests instead of processing your request.

From the server side, rate limiting serves three purposes:

  1. Prevents abuse — stops malicious scripts from overwhelming the API
  2. Ensures fair usage — one heavy user shouldn’t degrade performance for everyone else
  3. Controls infrastructure costs — every API call costs compute, memory, and bandwidth

The Common Rate Limit Algorithms

Token Bucket

The token bucket is the most widely used algorithm because it allows controlled bursts while enforcing an average rate.

Imagine a bucket that holds 100 tokens. Each API request takes one token. Tokens are added back at a steady rate (say, 10 per second). When the bucket is empty, requests are rejected until more tokens accumulate.

Bucket capacity: 100 tokens
Refill rate: 10 tokens/second

→ You can burst 100 requests instantly
→ After that, sustain 10 requests/second
→ If idle for 10 seconds, bucket refills to 100
class TokenBucket {
  constructor(capacity, refillRate) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRate; // tokens per millisecond
    this.lastRefill = Date.now();
  }

  consume() {
    this.refill();
    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true; // Request allowed
    }
    return false; // Rate limited
  }

  refill() {
    const now = Date.now();
    const elapsed = now - this.lastRefill;
    this.tokens = Math.min(
      this.capacity,
      this.tokens + elapsed * this.refillRate
    );
    this.lastRefill = now;
  }
}

Sliding Window

The sliding window algorithm counts requests within a rolling time window. Unlike fixed windows (which can allow 2x the limit at window boundaries), sliding windows provide consistent enforcement.

Window: 60 seconds
Limit: 100 requests

At any point in time, the server counts requests
from the last 60 seconds. If count >= 100, reject.

Most production APIs use a weighted sliding window (combining the current and previous window counts) for efficiency — it avoids storing every individual request timestamp.

Fixed Window

The simplest algorithm: reset the counter every N seconds.

Window: 60 seconds
Limit: 100 requests

Request at 59s: allowed (99/100 used)
Request at 61s: allowed (1/100 — new window)

The catch: a burst of 100 requests at second 59 followed by 100 at second 61 means 200 requests in 2 seconds, even though the “limit” is 100 per minute. This boundary problem is why most serious APIs use sliding windows instead.

Reading Rate Limit Headers

Well-designed APIs tell you exactly where you stand with response headers:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1700000060
Retry-After: 30
HeaderMeaning
X-RateLimit-LimitMaximum requests allowed per window
X-RateLimit-RemainingHow many requests you have left
X-RateLimit-ResetUnix timestamp when the window resets
Retry-AfterSeconds to wait before retrying (on 429 responses)

Need to convert timestamps? The X-RateLimit-Reset header is a Unix timestamp. Use our Unix Timestamp Converter to quickly see when your rate limit window resets in human-readable format.

Handling Rate Limits as an API Consumer

Exponential Backoff with Jitter

When you hit a 429, the worst thing you can do is immediately retry. The second worst thing is retrying after a fixed delay — because every other rate-limited client will also retry at the same moment, creating a “thundering herd.”

The solution is exponential backoff with jitter:

async function fetchWithRetry(url, options = {}, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const response = await fetch(url, options);

    if (response.status !== 429) {
      return response;
    }

    // Check Retry-After header first
    const retryAfter = response.headers.get('Retry-After');
    let delay;

    if (retryAfter) {
      delay = parseInt(retryAfter, 10) * 1000;
    } else {
      // Exponential backoff: 1s, 2s, 4s, 8s, 16s
      const baseDelay = Math.pow(2, attempt) * 1000;
      // Add jitter: random value between 0 and baseDelay
      delay = baseDelay + Math.random() * baseDelay;
    }

    console.log(`Rate limited. Retrying in ${Math.round(delay)}ms...`);
    await new Promise(resolve => setTimeout(resolve, delay));
  }

  throw new Error(`Failed after ${maxRetries} retries`);
}

The jitter is crucial — it spreads retry attempts across time so they don’t all hit the server simultaneously.

Client-Side Rate Limiting

The best strategy is to never hit the limit in the first place. If you know the API allows 100 requests per minute, throttle your client:

class RateLimiter {
  constructor(maxRequests, windowMs) {
    this.maxRequests = maxRequests;
    this.windowMs = windowMs;
    this.queue = [];
    this.timestamps = [];
  }

  async execute(fn) {
    return new Promise((resolve, reject) => {
      this.queue.push({ fn, resolve, reject });
      this.processQueue();
    });
  }

  processQueue() {
    const now = Date.now();
    // Remove timestamps outside the window
    this.timestamps = this.timestamps.filter(
      t => now - t < this.windowMs
    );

    while (this.queue.length > 0 &&
           this.timestamps.length < this.maxRequests) {
      const { fn, resolve, reject } = this.queue.shift();
      this.timestamps.push(now);
      fn().then(resolve).catch(reject);
    }

    // Schedule next check if queue has items
    if (this.queue.length > 0) {
      const oldestTimestamp = this.timestamps[0];
      const waitTime = this.windowMs - (now - oldestTimestamp);
      setTimeout(() => this.processQueue(), waitTime);
    }
  }
}

// Usage
const limiter = new RateLimiter(100, 60000); // 100 req/min
await limiter.execute(() => fetch('/api/users'));

Implementing Rate Limiting on Your Server

Express.js with Redis

For production APIs, use Redis as the rate limit store — it handles concurrent requests correctly and works across multiple server instances:

const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');

const redis = new Redis(process.env.REDIS_URL);

const apiLimiter = rateLimit({
  store: new RedisStore({ sendCommand: (...args) => redis.call(...args) }),
  windowMs: 60 * 1000,    // 1 minute
  max: 100,                // 100 requests per window
  standardHeaders: true,   // Send RateLimit-* headers
  legacyHeaders: false,    // Disable X-RateLimit-* headers
  message: {
    error: 'Too many requests',
    retryAfter: 60,
  },
});

app.use('/api/', apiLimiter);

Per-User vs Per-IP Limiting

IP-based limiting is easy but inaccurate — multiple users behind a corporate NAT share one IP. API key or token-based limiting is more precise:

const userLimiter = rateLimit({
  keyGenerator: (req) => {
    // Use API key if available, fall back to IP
    return req.headers['x-api-key'] || req.ip;
  },
  max: (req) => {
    // Different limits for different tiers
    if (req.user?.plan === 'premium') return 1000;
    if (req.user?.plan === 'basic') return 100;
    return 20; // Anonymous / free tier
  },
  windowMs: 60 * 1000,
});

Common Mistakes

Mistake 1: Retrying Immediately on 429

// WRONG: creates a retry storm
try {
  const res = await fetch(url);
} catch (e) {
  const res = await fetch(url); // Instantly retries
}

Always use exponential backoff. And always check the Retry-After header — the server is telling you exactly how long to wait.

Mistake 2: Rate Limiting After Authentication Only

Public endpoints (login, registration, password reset) need rate limiting too — arguably more than authenticated endpoints. Without it, attackers can brute-force credentials or enumerate email addresses.

// Rate limit login attempts aggressively
const loginLimiter = rateLimit({
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 10,                    // 10 attempts per window
  skipSuccessfulRequests: true, // Don't count successful logins
});

app.post('/auth/login', loginLimiter, loginHandler);

Mistake 3: Not Distinguishing Read vs Write Limits

GET requests are typically cheaper than POST/PUT/DELETE. Apply different limits:

const readLimiter = rateLimit({ windowMs: 60000, max: 200 });
const writeLimiter = rateLimit({ windowMs: 60000, max: 50 });

app.get('/api/*', readLimiter);
app.post('/api/*', writeLimiter);
app.put('/api/*', writeLimiter);
app.delete('/api/*', writeLimiter);

Mistake 4: Ignoring Rate Limit Headers in Your Client

I’ve reviewed codebases where developers parse API responses carefully but completely ignore the rate limit headers. Those headers are free information — use them to preemptively pause requests before hitting the limit.

Quick Checklist

As an API provider:

  • Implement sliding window or token bucket rate limiting
  • Use Redis for distributed rate limiting across instances
  • Return standard RateLimit-* headers and Retry-After on 429
  • Rate limit public endpoints (login, signup, password reset)
  • Apply different limits for read vs write operations
  • Log rate limit events for monitoring abuse patterns

As an API consumer:

  • Implement exponential backoff with jitter on 429
  • Read and respect Retry-After headers
  • Throttle requests client-side to stay under limits
  • Monitor X-RateLimit-Remaining to preemptively slow down
  • Cache responses to reduce unnecessary API calls

Further Reading


Need to debug API responses? Our JSON Formatter helps you inspect rate limit error payloads, and the Unix Timestamp Converter decodes those X-RateLimit-Reset timestamps into readable dates.

Necmeddin Cunedioglu
Necmeddin Cunedioglu Author

Software developer and the creator of UseToolSuite. I write about the tools and techniques I use daily as a developer — practical guides based on real experience, not theory.