Implementing Exponential Backoff for Failed Chunks
Retry a failed chunk with full-jitter exponential backoff, cap the attempts, honor any Retry-After header, and make the PUT idempotent so a retry never duplicates or corrupts the upload.
Transient failures — a dropped socket, a 503 during a deploy, a rate-limit 429 — should not kill a multi-gigabyte transfer. Disciplined retry is the heart of upload error recovery patterns within frontend UX, chunking and progress tracking. The trap is naive retry: fixed delays cause thundering herds, unbounded retries hang the UI, and non-idempotent writes double-append data. This guide implements full-jitter backoff with a ceiling, server-directed delays, and idempotent chunk PUTs, complementing the foundations in browser timeout and retry logic.
When to use this approach
- You upload chunks over HTTP and need to survive intermittent
429,503, and network errors without restarting the whole file. - You want retries that spread out under load (jitter) rather than synchronizing into a self-inflicted spike.
- Your chunk endpoint can be made idempotent (a deterministic key or offset), so a retried PUT is safe.
Prerequisites
- A chunk endpoint that accepts
PUTwith a stable per-chunk key (offset or index). - The server returns
Retry-Afteron429/503when it wants to pace you. fetchwithAbortSignal.timeout(Node 20+ or any evergreen browser).
Implementation
putChunkWithBackoff retries a single chunk. It classifies the failure, computes a full-jitter delay capped at a maximum, prefers the server’s Retry-After when present, stops after maxRetries, and sends an idempotency key so the server can dedupe.
export interface ChunkInput {
uploadId: string;
index: number;
offset: number;
blob: Blob;
url: string;
}
interface BackoffOptions {
maxRetries: number;
baseMs: number;
capMs: number;
perAttemptTimeoutMs: number;
}
const DEFAULTS: BackoffOptions = {
maxRetries: 6,
baseMs: 500,
capMs: 30_000,
perAttemptTimeoutMs: 45_000,
};
const RETRIABLE_STATUS = new Set([408, 425, 429, 500, 502, 503, 504]);
function sleep(ms: number): Promise<void> {
return new Promise((r) => setTimeout(r, ms));
}
/** Full-jitter backoff: random between 0 and min(cap, base * 2^attempt). */
function jitteredDelay(attempt: number, opts: BackoffOptions): number {
const exp = Math.min(opts.capMs, opts.baseMs * 2 ** attempt);
return Math.random() * exp;
}
function parseRetryAfter(header: string | null): number | null {
if (!header) return null;
const seconds = Number(header);
if (!Number.isNaN(seconds)) return seconds * 1000; // delta-seconds form
const date = Date.parse(header); // HTTP-date form
return Number.isNaN(date) ? null : Math.max(0, date - Date.now());
}
export async function putChunkWithBackoff(
chunk: ChunkInput,
options: Partial<BackoffOptions> = {},
): Promise<void> {
const opts = { ...DEFAULTS, ...options };
// Stable key => the server treats a retried PUT as the same write.
const idempotencyKey = `${chunk.uploadId}:${chunk.index}`;
for (let attempt = 0; attempt <= opts.maxRetries; attempt++) {
try {
const res = await fetch(chunk.url, {
method: "PUT",
headers: {
"Content-Type": "application/octet-stream",
"Idempotency-Key": idempotencyKey,
"Content-Range": `bytes ${chunk.offset}-${chunk.offset + chunk.blob.size - 1}/*`,
},
body: chunk.blob,
signal: AbortSignal.timeout(opts.perAttemptTimeoutMs),
});
if (res.ok) return; // 2xx — chunk stored
if (!RETRIABLE_STATUS.has(res.status)) {
throw new Error(`Permanent failure on chunk ${chunk.index}: HTTP ${res.status}`);
}
if (attempt === opts.maxRetries) {
throw new Error(`Chunk ${chunk.index} failed after ${opts.maxRetries} retries`);
}
// Prefer the server's pacing; otherwise back off with jitter.
const serverDelay = parseRetryAfter(res.headers.get("Retry-After"));
const delay = serverDelay ?? jitteredDelay(attempt, opts);
console.warn(`[chunk ${chunk.index}] HTTP ${res.status}, retry in ${Math.round(delay)}ms`);
await sleep(delay);
} catch (err) {
const e = err as Error;
const transient = e.name === "TimeoutError" || e.name === "AbortError" || e.message.includes("network");
if (!transient || attempt === opts.maxRetries) throw e;
const delay = jitteredDelay(attempt, opts);
console.warn(`[chunk ${chunk.index}] ${e.name}, retry in ${Math.round(delay)}ms`);
await sleep(delay);
}
}
}
Line-by-line of the critical parts
RETRIABLE_STATUSlists only transient codes. A400or403is permanent — retrying wastes time and hides a real bug, so it throws immediately.jitteredDelayis full jitter:Math.random() * min(cap, base * 2^attempt). The exponential term grows the ceiling; the random factor desynchronizes clients. This is strictly better than “equal jitter” or fixed backoff for avoiding retry storms.capMsstops the exponential from exploding — without it, attempt 10 would wait hours. Thirty seconds is a sane ceiling for interactive uploads.parseRetryAfterhandles bothRetry-Afterforms: an integer delta-seconds and an HTTP-date. When the server tells you when to come back, that always wins over your computed delay.Idempotency-Key: uploadId:indexis the linchpin of safe retries. If a chunk actually succeeded but the response was lost, the retry carries the same key and the server returns the prior result instead of writing twice. AContent-Rangeoffset gives the same guarantee for range-addressed stores.AbortSignal.timeout(perAttemptTimeoutMs)bounds each attempt so a hung socket counts as a failure and triggers backoff rather than blocking forever.- The
catchbranch retries only genuine transient errors (TimeoutError,AbortError, network). Any other thrown error (including the permanent-status throw above) propagates out.
The timeline shows how the delay window widens per attempt while jitter scatters the actual fire times.
Configuration gotchas
Retried chunks duplicate data. Without an Idempotency-Key (or offset-addressed PUT), a chunk that succeeded but whose response was lost gets written twice on retry, corrupting the assembled file. Always send a stable key derived from uploadId and chunk index.
429 Too Many Requests ignored, ban escalates. Computing your own delay while the server sent Retry-After: 120 makes you retry too early and earn a longer block. Parse and honor Retry-After before falling back to jitter.
Error: signal timed out on every attempt. perAttemptTimeoutMs is shorter than the time to upload one chunk on a slow link. Size the timeout to chunkSize / minExpectedBandwidth plus headroom, or shrink the chunk.
Retry storm after a server blip. Fixed or equal delays make all clients retry at the same instant. Full jitter (Math.random() * window) is what spreads them out — do not “improve” it into a fixed delay.
Verification
Assert the backoff sequence is bounded and monotonic-in-expectation, and that a Retry-After overrides it:
// Deterministic test: stub fetch to fail twice with 503 then succeed.
let calls = 0;
const original = globalThis.fetch;
globalThis.fetch = async () => {
calls++;
if (calls <= 2) {
return new Response(null, { status: 503, headers: { "Retry-After": "0" } });
}
return new Response(null, { status: 200 });
};
const blob = new Blob([new Uint8Array(1024)]);
await putChunkWithBackoff(
{ uploadId: "u1", index: 0, offset: 0, blob, url: "https://x/c" },
{ baseMs: 1, capMs: 5, perAttemptTimeoutMs: 1000 },
);
console.assert(calls === 3, `expected 3 attempts, got ${calls}`);
globalThis.fetch = original;
console.log("backoff honored Retry-After and succeeded on attempt 3");
FAQ
Why full jitter instead of plain exponential backoff?
Plain exponential backoff keeps every client’s retries aligned to the same grid, so they hammer the recovering server in waves. Full jitter randomizes each delay across the whole window, flattening the aggregate load — it is the variant AWS recommends for distributed retries.
Should Retry-After always win over my computed delay?
Yes. The server knows its own recovery timeline; honoring Retry-After avoids premature retries that extend rate-limit blocks. Fall back to jittered backoff only when the header is absent.
How many retries is reasonable?
Six attempts with a 30-second cap covers most transient incidents (roughly a minute of total wait at the tail) without freezing the UI indefinitely. Beyond that, surface the failure and let the user (or a queue) decide — see resuming uploads after network loss.
What makes a chunk PUT idempotent?
A deterministic address: either an Idempotency-Key the server records, or a Content-Range/offset so the storage layer writes the same bytes to the same place. Both let a duplicate retry be a no-op instead of an append.