Skip to content

Vision โ€‹

Vision lets you pass images, PDFs, and other media to vision-capable models. Use the multiModal() helper to combine text prompts with one or more image sources.

ts
import {
  multiModal,
  imageUrl,
  imageFile,
  imageBuffer,
} from 'confused-ai';

Pass a remote image โ€‹

ts
import { createAgent, multiModal, imageUrl } from 'confused-ai';

const agent = createAgent({
  name: 'vision-agent',
  instructions: 'Analyse the images provided by the user.',
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY!,
});

const result = await agent.run(
  multiModal(
    'What is in this image?',
    imageUrl('https://example.com/photo.jpg'),
  ),
);

Pass a local file โ€‹

ts
import { multiModal, imageFile } from 'confused-ai';

// imageFile() is async โ€” loads and base64-encodes the file
const result = await agent.run(
  multiModal(
    'Describe this chart.',
    await imageFile('./chart.png'),
  ),
);

Pass a buffer (canvas, upload, fetch response) โ€‹

ts
import { multiModal, imageBuffer } from 'confused-ai';

const response = await fetch('https://example.com/diagram.png');
const buffer = await response.arrayBuffer();

const result = await agent.run(
  multiModal(
    'What does this architecture diagram show?',
    imageBuffer(buffer, 'image/png'),
  ),
);

Multiple images in one message โ€‹

ts
const result = await agent.run(
  multiModal(
    'Compare these two screenshots and explain the differences.',
    imageUrl(beforeUrl),
    imageUrl(afterUrl),
  ),
);

Image detail level โ€‹

Control quality vs. speed with the detail option:

ts
imageUrl('https://example.com/photo.jpg', { detail: 'high' })
imageUrl('https://example.com/thumbnail.jpg', { detail: 'low' })
// 'auto' (default) โ€” model decides

ImageSource types โ€‹

TypeFactoryDescription
ImageUrlimageUrl(url, opts?)HTTPS or data URI
ImageFileawait imageFile(path, opts?)Local file โ€” loaded at call time (Node.js only)
ImageBufferimageBuffer(data, mimeType, opts?)Raw ArrayBuffer / Uint8Array

Supported formats โ€‹

Images: jpg, jpeg, png, gif, webp, bmp, svg, tiff, heic
Audio: mp3, wav, ogg, m4a, flac, webm
Video: mp4, webm, mov, avi, mkv


Where to go next โ€‹

  • Voice โ€” speech input and output.
  • Video โ€” process and summarise video content.
  • Agents โ€” agent.run() multiModal option.

Released under the MIT License.