Skip to content

Compression โ€‹

CompressionManager detects when message threads have grown too large and compresses verbose tool outputs into compact, fact-preserving summaries โ€” in-place, without losing the context the task depends on.

ts
import { CompressionManager } from 'confused-ai';

Quick start โ€‹

ts
import { createAgent } from 'confused-ai';
import { CompressionManager } from 'confused-ai';
import { OpenAIProvider } from 'confused-ai';

const provider = new OpenAIProvider({ apiKey: process.env.OPENAI_API_KEY! });

const compression = new CompressionManager({
  // Provide an LLM callable for summarisation
  generate: async (msgs) => {
    const response = await provider.generateText({
      messages: msgs,
      model: 'gpt-4o-mini',
    });
    return response.text;
  },
  compressToolResults: true,
  compressToolResultsLimit: 3,   // compress after 3+ tool messages
  compressTokenLimit: 4096,      // also compress any message > ~4096 tokens
});

const agent = createAgent({
  name: 'research-agent',
  instructions: 'Research topics in depth using multiple searches.',
  model: 'gpt-4o-mini',
  apiKey: process.env.OPENAI_API_KEY!,
  compression,
});

CompressionManager API โ€‹

Constructor options โ€‹

ts
interface CompressionManagerConfig {
  /** LLM callable for summarisation */
  generate: (messages: Array<{ role: string; content: string }>) => Promise<string>;

  /** Whether to compress tool / function call results (default: true) */
  compressToolResults?: boolean;

  /** Minimum number of tool messages before compressing (default: 3) */
  compressToolResultsLimit?: number;

  /**
   * Single-message content token threshold above which compression triggers
   * regardless of message count. Estimated as content.length / 4.
   * Set to 0 to disable. (default: 4096)
   */
  compressTokenLimit?: number;

  /** Override the default compression system prompt */
  prompt?: string;

  debug?: boolean;
}

Methods โ€‹

ts
// Check if the message list needs compression
cm.shouldCompress(messages);

// Compress tool-result messages in-place (sequential)
await cm.compress(messages);

// Compress in parallel (faster for large batches)
await cm.acompress(messages);

What gets compressed โ€‹

  • Tool / function-call result messages where content exceeds compressTokenLimit
  • Any batch of tool-result messages that reaches compressToolResultsLimit

Compressed messages have the original content replaced with a fact-preserving summary. The original role and all other message fields are preserved.


Manual use in hooks โ€‹

You can also trigger compression explicitly in an afterRun hook or before sending to the model:

ts
const agent = createAgent({
  name: 'deep-researcher',
  instructions: '...',
  model: 'gpt-4o-mini',
  apiKey: process.env.OPENAI_API_KEY!,
  hooks: {
    beforeRun: async (input) => {
      if (compression.shouldCompress(input.messages)) {
        await compression.acompress(input.messages);
      }
      return input;
    },
  },
});

Default compression prompt โ€‹

The built-in prompt instructs the model to:

  1. Preserve all key facts, entities, IDs, numbers, names, dates.
  2. Remove filler, pleasantries, repeated boilerplate, and excess whitespace.
  3. Keep the same language as the input.
  4. Output only the compressed content โ€” no preamble.

Override it with the prompt option if your domain has specific compression requirements.



Mastermind Context Compression Suite โ€‹

While CompressionManager handles general summarization, the Mastermind compression pipeline is a production-grade, multi-stage compression engine. It optimizes KV-cache reuse, compresses message formats using specialized parsers, enforces strict token budgets, and stashes original data in an on-demand retrieval store (CCR).

The pipeline executes four stages on every run:

  1. CacheAligner: Stabilizes the prefix of the message history to maximize KV-cache hits.
  2. Content Routing & Crusher Dispatch: Routes message content based on type (JSON, Code, Logs, CSV, XML) to specialized, deterministic parsing algorithms that compress the text without LLM latency.
  3. Group-based Budget Enforcement: Drops conversation groups oldest-first to fit within a strict token budget, while ensuring tool calls and tool results are never orphaned.
  4. Code & Context Reduction (CCR): Replaces original content with compressed annotations, stashing the originals. Re-injects a retrieval tool so the agent can fetch the raw details if needed.
ts
import { Mastermind, OpenAIProvider } from 'confused-ai';

const provider = new OpenAIProvider({ apiKey: process.env.OPENAI_API_KEY! });

const mastermind = new Mastermind({
  contextTokenBudget: 16_000,        // Fit history within 16k tokens
  messageTokenThreshold: 1_500,       // Only compress messages larger than 1.5k tokens
  enableCCR: true,                    // Allow agents to retrieve uncompressed content
  recentMessagesWindow: 4,            // Keep the last 4 messages completely uncompressed
  generate: async (msgs) => {         // Fallback LLM summarizer for prose
    const res = await provider.generateText({
      messages: msgs,
      model: 'gpt-4o-mini',
    });
    return res.text;
  },
});

const agent = createAgent({
  name: 'mastermind-agent',
  instructions: 'Use your tools to solve tasks.',
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY!,
  // Expose the retrieve tool so the agent can recall compressed details
  tools: [mastermind.retrieveTool],
  hooks: {
    beforeRun: async (input) => {
      const { messages } = await mastermind.compress(input.messages);
      // Materialize replaces message contents with their compressed versions
      input.messages = Mastermind.materialize(messages);
      return input;
    },
  },
});

Stage 1: Cache Aligner โ€‹

LLM providers charge less and respond faster when prompts hit their KV-cache. The CacheAligner normalizes whitespaces, matches repetitive formatting, and structures history headers so prefix matches are maximized.

Stage 2: Specialized Crushers โ€‹

Instead of relying purely on expensive LLMs to summarize structural data, Mastermind inspects the text and routes it to optimized local parsers:

  • JSON Crusher (smart-crusher): Strips empty properties, normalizes indentation, and collapses deeply nested schemas.
  • Code Compressor: Minifies JS/TS, python, and other codeblocks by removing comments, redundant blank lines, and compressing indentation.
  • Log Crusher: Aggregates duplicate trace lines, strips timestamps, and retains only unique stack traces or warning/error contexts.
  • CSV / XML Crushers: Retains headers while truncating or downsampling datasets.
  • Prose Summarizer (summary-llm): Falls back to an LLM summary ONLY when unstructured markdown/prose is detected.

Stage 3: Sliding-Window Group Budget Enforcement โ€‹

When history exceeds the budget, Mastermind drops the oldest messages. However, standard truncation often separates a tool call from its tool result, breaking the ReAct loop. Mastermind groups assistant tool calls and their subsequent tool results into atomic blocks that are dropped together, ensuring the conversation tree remains valid.

Stage 4: Code & Context Reduction (CCR) โ€‹

For highly detailed inputs, compression can lose crucial bits. Under CCR:

  1. Mastermind compresses the message and stashes the raw string in an in-memory CCRStore.
  2. The message is annotated with a handle: e.g., [CCR_REF: ccr-89f41] (Compressed JSON).
  3. If the agent notices this reference and needs the exact values, it invokes the built-in retrieveTool (e.g. retrieve_uncompressed_context({ handle: "ccr-89f41" })) to fetch the uncompressed original.

Where to go next โ€‹

  • Session โ€” conversation persistence; use compression to keep sessions lean.
  • Memory โ€” retain selected facts rather than summarising everything.
  • Context providers โ€” inject context deliberately instead of accumulating it.

Released under the MIT License.