Implementing Exponential Backoff for Failed Chunks

Retry a failed chunk with full-jitter exponential backoff, cap the attempts, honor any Retry-After header, and make the PUT idempotent so a retry never duplicates or corrupts the upload.

Transient failures — a dropped socket, a 503 during a deploy, a rate-limit 429 — should not kill a multi-gigabyte transfer. Disciplined retry is the heart of upload error recovery patterns within frontend UX, chunking and progress tracking. The trap is naive retry: fixed delays cause thundering herds, unbounded retries hang the UI, and non-idempotent writes double-append data. This guide implements full-jitter backoff with a ceiling, server-directed delays, and idempotent chunk PUTs, complementing the foundations in browser timeout and retry logic.

When to use this approach

  • You upload chunks over HTTP and need to survive intermittent 429, 503, and network errors without restarting the whole file.
  • You want retries that spread out under load (jitter) rather than synchronizing into a self-inflicted spike.
  • Your chunk endpoint can be made idempotent (a deterministic key or offset), so a retried PUT is safe.

Prerequisites

  1. A chunk endpoint that accepts PUT with a stable per-chunk key (offset or index).
  2. The server returns Retry-After on 429/503 when it wants to pace you.
  3. fetch with AbortSignal.timeout (Node 20+ or any evergreen browser).

Implementation

putChunkWithBackoff retries a single chunk. It classifies the failure, computes a full-jitter delay capped at a maximum, prefers the server’s Retry-After when present, stops after maxRetries, and sends an idempotency key so the server can dedupe.

export interface ChunkInput {
  uploadId: string;
  index: number;
  offset: number;
  blob: Blob;
  url: string;
}

interface BackoffOptions {
  maxRetries: number;
  baseMs: number;
  capMs: number;
  perAttemptTimeoutMs: number;
}

const DEFAULTS: BackoffOptions = {
  maxRetries: 6,
  baseMs: 500,
  capMs: 30_000,
  perAttemptTimeoutMs: 45_000,
};

const RETRIABLE_STATUS = new Set([408, 425, 429, 500, 502, 503, 504]);

function sleep(ms: number): Promise<void> {
  return new Promise((r) => setTimeout(r, ms));
}

/** Full-jitter backoff: random between 0 and min(cap, base * 2^attempt). */
function jitteredDelay(attempt: number, opts: BackoffOptions): number {
  const exp = Math.min(opts.capMs, opts.baseMs * 2 ** attempt);
  return Math.random() * exp;
}

function parseRetryAfter(header: string | null): number | null {
  if (!header) return null;
  const seconds = Number(header);
  if (!Number.isNaN(seconds)) return seconds * 1000; // delta-seconds form
  const date = Date.parse(header); // HTTP-date form
  return Number.isNaN(date) ? null : Math.max(0, date - Date.now());
}

export async function putChunkWithBackoff(
  chunk: ChunkInput,
  options: Partial<BackoffOptions> = {},
): Promise<void> {
  const opts = { ...DEFAULTS, ...options };
  // Stable key => the server treats a retried PUT as the same write.
  const idempotencyKey = `${chunk.uploadId}:${chunk.index}`;

  for (let attempt = 0; attempt <= opts.maxRetries; attempt++) {
    try {
      const res = await fetch(chunk.url, {
        method: "PUT",
        headers: {
          "Content-Type": "application/octet-stream",
          "Idempotency-Key": idempotencyKey,
          "Content-Range": `bytes ${chunk.offset}-${chunk.offset + chunk.blob.size - 1}/*`,
        },
        body: chunk.blob,
        signal: AbortSignal.timeout(opts.perAttemptTimeoutMs),
      });

      if (res.ok) return; // 2xx — chunk stored

      if (!RETRIABLE_STATUS.has(res.status)) {
        throw new Error(`Permanent failure on chunk ${chunk.index}: HTTP ${res.status}`);
      }
      if (attempt === opts.maxRetries) {
        throw new Error(`Chunk ${chunk.index} failed after ${opts.maxRetries} retries`);
      }

      // Prefer the server's pacing; otherwise back off with jitter.
      const serverDelay = parseRetryAfter(res.headers.get("Retry-After"));
      const delay = serverDelay ?? jitteredDelay(attempt, opts);
      console.warn(`[chunk ${chunk.index}] HTTP ${res.status}, retry in ${Math.round(delay)}ms`);
      await sleep(delay);
    } catch (err) {
      const e = err as Error;
      const transient = e.name === "TimeoutError" || e.name === "AbortError" || e.message.includes("network");
      if (!transient || attempt === opts.maxRetries) throw e;
      const delay = jitteredDelay(attempt, opts);
      console.warn(`[chunk ${chunk.index}] ${e.name}, retry in ${Math.round(delay)}ms`);
      await sleep(delay);
    }
  }
}

Line-by-line of the critical parts

  • RETRIABLE_STATUS lists only transient codes. A 400 or 403 is permanent — retrying wastes time and hides a real bug, so it throws immediately.
  • jitteredDelay is full jitter: Math.random() * min(cap, base * 2^attempt). The exponential term grows the ceiling; the random factor desynchronizes clients. This is strictly better than “equal jitter” or fixed backoff for avoiding retry storms.
  • capMs stops the exponential from exploding — without it, attempt 10 would wait hours. Thirty seconds is a sane ceiling for interactive uploads.
  • parseRetryAfter handles both Retry-After forms: an integer delta-seconds and an HTTP-date. When the server tells you when to come back, that always wins over your computed delay.
  • Idempotency-Key: uploadId:index is the linchpin of safe retries. If a chunk actually succeeded but the response was lost, the retry carries the same key and the server returns the prior result instead of writing twice. A Content-Range offset gives the same guarantee for range-addressed stores.
  • AbortSignal.timeout(perAttemptTimeoutMs) bounds each attempt so a hung socket counts as a failure and triggers backoff rather than blocking forever.
  • The catch branch retries only genuine transient errors (TimeoutError, AbortError, network). Any other thrown error (including the permanent-status throw above) propagates out.

The timeline shows how the delay window widens per attempt while jitter scatters the actual fire times.

Full-jitter exponential backoff timeline Each retry attempt picks a random delay within a window that doubles, capped at a maximum, with Retry-After overriding when present. time → try window 0.5s retry 1 window 1s retry 2 window 2s (capped) retry 3 delay = random( 0, min(cap, base x 2^n) )
The retry window doubles each attempt; the actual delay is a random point inside it.

Configuration gotchas

Retried chunks duplicate data. Without an Idempotency-Key (or offset-addressed PUT), a chunk that succeeded but whose response was lost gets written twice on retry, corrupting the assembled file. Always send a stable key derived from uploadId and chunk index.

429 Too Many Requests ignored, ban escalates. Computing your own delay while the server sent Retry-After: 120 makes you retry too early and earn a longer block. Parse and honor Retry-After before falling back to jitter.

Error: signal timed out on every attempt. perAttemptTimeoutMs is shorter than the time to upload one chunk on a slow link. Size the timeout to chunkSize / minExpectedBandwidth plus headroom, or shrink the chunk.

Retry storm after a server blip. Fixed or equal delays make all clients retry at the same instant. Full jitter (Math.random() * window) is what spreads them out — do not “improve” it into a fixed delay.

Verification

Assert the backoff sequence is bounded and monotonic-in-expectation, and that a Retry-After overrides it:

// Deterministic test: stub fetch to fail twice with 503 then succeed.
let calls = 0;
const original = globalThis.fetch;
globalThis.fetch = async () => {
  calls++;
  if (calls <= 2) {
    return new Response(null, { status: 503, headers: { "Retry-After": "0" } });
  }
  return new Response(null, { status: 200 });
};

const blob = new Blob([new Uint8Array(1024)]);
await putChunkWithBackoff(
  { uploadId: "u1", index: 0, offset: 0, blob, url: "https://x/c" },
  { baseMs: 1, capMs: 5, perAttemptTimeoutMs: 1000 },
);
console.assert(calls === 3, `expected 3 attempts, got ${calls}`);
globalThis.fetch = original;
console.log("backoff honored Retry-After and succeeded on attempt 3");

FAQ

Why full jitter instead of plain exponential backoff?

Plain exponential backoff keeps every client’s retries aligned to the same grid, so they hammer the recovering server in waves. Full jitter randomizes each delay across the whole window, flattening the aggregate load — it is the variant AWS recommends for distributed retries.

Should Retry-After always win over my computed delay?

Yes. The server knows its own recovery timeline; honoring Retry-After avoids premature retries that extend rate-limit blocks. Fall back to jittered backoff only when the header is absent.

How many retries is reasonable?

Six attempts with a 30-second cap covers most transient incidents (roughly a minute of total wait at the tail) without freezing the UI indefinitely. Beyond that, surface the failure and let the user (or a queue) decide — see resuming uploads after network loss.

What makes a chunk PUT idempotent?

A deterministic address: either an Idempotency-Key the server records, or a Content-Range/offset so the storage layer writes the same bytes to the same place. Both let a duplicate retry be a no-op instead of an append.