13 · Production Resilience 🔴

Production agents need to survive API outages, rate limits, slow responses, and cascading failures. This example shows the patterns that keep agents running when things go wrong.

What you'll learn

Fallback chains (if GPT-4 is down, use Claude)
Retry with exponential backoff
Circuit breakers (stop hammering a failing service)
Timeout guards
Graceful degradation

1 · LLM Fallback Chain

If your primary model is unavailable, automatically fall back to alternatives:

import { createAgent } from 'confused-ai';
import { FallbackChain } from 'confused-ai/model';

// Tries models in order — uses first one that responds
const resilientModel = new FallbackChain([
  { provider: 'openai',    model: 'gpt-4o',               timeout: 30_000 },
  { provider: 'anthropic', model: 'claude-3-5-sonnet-20241022', timeout: 30_000 },
  { provider: 'openai',    model: 'gpt-4o-mini',          timeout: 15_000 }, // cheaper fallback
]);

const agent = createAgent({
  name: 'resilient-agent',
  model: resilientModel,
  instructions: 'You are a helpful assistant.',
});

// If gpt-4o is down → tries Claude → tries gpt-4o-mini
const result = await agent.run('Explain quantum entanglement simply.');
console.log(result.text);
console.log('Used model:', result.modelUsed); // whichever succeeded

2 · Retry with Backoff

import { createAgent } from 'confused-ai';

const agent = createAgent({
  model: 'gpt-4o-mini',
  retry: {
    maxAttempts: 3,
    initialDelayMs: 1_000,
    backoffMultiplier: 2,   // 1s → 2s → 4s
    retryOn: [              // only retry these errors
      'rate_limit',
      'server_error',
      'timeout',
    ],
  },
});

Manual retry wrapper for tools:

async function withRetry<T>(
  fn: () => Promise<T>,
  { maxAttempts = 3, delayMs = 1000 } = {}
): Promise<T> {
  let lastErr: Error = new Error('unknown');
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastErr = err instanceof Error ? err : new Error(String(err));
      if (attempt < maxAttempts) {
        const backoff = delayMs * Math.pow(2, attempt - 1);
        console.warn(`Attempt ${attempt} failed, retrying in ${backoff}ms...`);
        await new Promise(r => setTimeout(r, backoff));
      }
    }
  }
  throw lastErr;
}

// Usage
const result = await withRetry(() => agent.run('Summarize this article: ...'));

3 · Circuit Breaker

After N failures, the circuit "opens" — all requests fail fast for a cooldown period instead of continuing to hammer a broken service.

// npm install confused-ai (includes circuit breaker)
import { CircuitBreaker } from 'confused-ai';

const breaker = new CircuitBreaker({
  name: 'openai-api',
  threshold: 5,          // open after 5 failures
  timeout: 60_000,       // stay open for 60s
  halfOpenRequests: 2,   // allow 2 test requests when half-open
});

// Wrap any async operation
const result = await breaker.execute(async () => {
  return await agent.run('Hello');
});

// Monitor state
console.log(breaker.state);  // 'closed' | 'open' | 'half-open'
console.log(breaker.stats);  // { failures, successes, rejections }

4 · Timeout Guard

Never let an agent hang forever:

function withTimeout<T>(promise: Promise<T>, ms: number, label = 'operation'): Promise<T> {
  const timeout = new Promise<never>((_, reject) =>
    setTimeout(() => reject(new Error(`${label} timed out after ${ms}ms`)), ms)
  );
  return Promise.race([promise, timeout]);
}

// Agent call with 30-second hard timeout
const result = await withTimeout(
  agent.run('Research and summarize the latest AI news'),
  30_000,
  'agent.run'
);

Or configure it directly:

const agent = createAgent({
  model: 'gpt-4o-mini',
  timeoutMs: 30_000,       // agent-level timeout
  toolTimeoutMs: 10_000,   // per-tool timeout
});

5 · Graceful Degradation

When the agent fails, fall back to a static response:

async function safeAgentRun(message: string, userId: string) {
  try {
    const result = await withTimeout(
      agent.run(message, { userId }),
      20_000
    );
    return { text: result.text, source: 'agent' };
  } catch (err) {
    const error = err instanceof Error ? err : new Error(String(err));
    
    // Log the failure
    console.error('[agent] failed:', error.message, { userId, message });
    
    // Return a graceful fallback
    if (error.message.includes('rate_limit')) {
      return {
        text: "We're experiencing high demand. Please try again in a moment.",
        source: 'fallback',
      };
    }
    return {
      text: "I'm temporarily unavailable. Our team has been notified. Please try again shortly.",
      source: 'fallback',
    };
  }
}

6 · Health Check Endpoint

import { createServer } from 'node:http';

let agentHealthy = true;
let lastSuccessAt = Date.now();

// Probe the agent every 30s
setInterval(async () => {
  try {
    await withTimeout(agent.run('ping'), 5_000);
    agentHealthy = true;
    lastSuccessAt = Date.now();
  } catch {
    agentHealthy = false;
    console.error('[health] agent probe failed');
  }
}, 30_000);

const server = createServer((req, res) => {
  if (req.url === '/health') {
    const status = agentHealthy ? 200 : 503;
    res.writeHead(status, { 'Content-Type': 'application/json' });
    res.end(JSON.stringify({
      status: agentHealthy ? 'ok' : 'degraded',
      lastSuccessAt: new Date(lastSuccessAt).toISOString(),
    }));
  }
});

7 · Full ResilientAgent Pattern

Combine everything:

import { createAgent } from 'confused-ai';
import { FallbackChain } from 'confused-ai/model';

const agent = createAgent({
  name: 'production-agent',
  model: new FallbackChain([
    { provider: 'openai',    model: 'gpt-4o',       timeout: 25_000 },
    { provider: 'anthropic', model: 'claude-3-5-sonnet-20241022', timeout: 25_000 },
    { provider: 'openai',    model: 'gpt-4o-mini',  timeout: 15_000 },
  ]),
  retry: { maxAttempts: 2, backoffMultiplier: 2 },
  timeoutMs: 45_000,
  toolTimeoutMs: 10_000,
  hooks: {
    onError: (err, ctx) => {
      // Send to your alerting system
      void alerting.notify({ error: err.message, userId: ctx.userId });
    },
  },
});

What's next?

14 · MCP Filesystem Agent — Model Context Protocol tools
15 · Full-Stack App — put it all together

13 · Production Resilience 🔴 ​

What you'll learn ​

1 · LLM Fallback Chain ​

2 · Retry with Backoff ​

3 · Circuit Breaker ​

4 · Timeout Guard ​

5 · Graceful Degradation ​

6 · Health Check Endpoint ​

7 · Full ResilientAgent Pattern ​

What's next? ​