Skip to content

Voice โ€‹

The voice module provides TTS/STT adapters and a real-time streaming session that wires together speech input โ†’ agent reasoning โ†’ spoken output.

ts
import {
  OpenAIVoiceProvider,
  ElevenLabsVoiceProvider,
  createVoiceProvider,
  VoiceStreamSession,
} from 'confused-ai';

Text-to-Speech (TTS) โ€‹

ts
import { OpenAIVoiceProvider } from 'confused-ai';

const voice = new OpenAIVoiceProvider({ apiKey: process.env.OPENAI_API_KEY });

const result = await voice.textToSpeech('Hello, how can I help you today?', {
  voiceId: 'nova',   // 'alloy' | 'echo' | 'fable' | 'onyx' | 'nova' | 'shimmer'
  model: 'tts-1-hd',
  speed: 1.0,        // 0.25โ€“4.0
});

// result.audio     โ€” ArrayBuffer
// result.format    โ€” 'mp3' | 'opus' | 'aac' | 'flac' | 'wav' | 'pcm'
// result.durationSeconds
// result.characterCount

Speech-to-Text (STT) โ€‹

ts
const audioBuffer: ArrayBuffer = await readAudioFile('./question.mp3');

const transcript = await voice.speechToText(audioBuffer, { language: 'en' });
// transcript.text        โ€” transcribed string
// transcript.language    โ€” detected language
// transcript.confidence  โ€” 0โ€“1
// transcript.durationSeconds

ElevenLabs provider โ€‹

ts
import { ElevenLabsVoiceProvider } from 'confused-ai';

const voice = new ElevenLabsVoiceProvider({
  apiKey: process.env.ELEVENLABS_API_KEY!,
});

const result = await voice.textToSpeech('Welcome back!', {
  voiceId: 'rachel',  // ElevenLabs voice ID
});

createVoiceProvider factory โ€‹

ts
import { createVoiceProvider } from 'confused-ai';

const voice = createVoiceProvider({
  provider: process.env.VOICE_PROVIDER as 'openai' | 'elevenlabs',
  apiKey: process.env.VOICE_API_KEY!,
  voiceId: 'nova',
  model: 'tts-1',
});

Real-time voice streaming โ€‹

VoiceStreamSession connects a microphone stream to an agent and streams synthesised audio back:

ts
import { OpenAIVoiceProvider, VoiceStreamSession } from 'confused-ai';
import { createAgent } from 'confused-ai';

const voiceProvider = new OpenAIVoiceProvider();
const agent = createAgent({
  name: 'voice-assistant',
  instructions: 'You are a helpful voice assistant. Be concise.',
  model: 'gpt-4o-mini',
  apiKey: process.env.OPENAI_API_KEY!,
});

const session = new VoiceStreamSession({
  stt: voiceProvider,
  tts: voiceProvider,
  run: async (text) => {
    const result = await agent.run(text);
    return result.text;
  },
  voiceId: 'nova',
  silenceThresholdMs: 800,   // ms of silence to trigger STT
});

// Push raw PCM audio chunks in real time (from microphone, WebSocket, etc.)
microphone.on('data', (chunk) => session.pushChunk(chunk));

// Consume events
for await (const event of session.events()) {
  switch (event.type) {
    case 'transcript':
      console.log('User said:', event.text);
      break;
    case 'text_delta':
      process.stdout.write(event.delta ?? '');
      break;
    case 'audio':
      speaker.write(event.chunk);  // play synthesised audio
      break;
    case 'agent_end':
      console.log('Agent finished speaking');
      break;
    case 'error':
      console.error('Voice error:', event.error);
      break;
  }
}

await session.end();

VoiceStreamEvent types โ€‹

typeFieldsDescription
transcripttextSTT result from user speech
agent_startโ€”Agent processing began
text_deltadeltaOne token from agent response
audiochunk (Uint8Array)TTS audio chunk (MP3)
agent_endโ€”Agent finished responding
errorerrorSession-level error

VoiceProvider interface โ€‹

Implement this to add any TTS/STT backend:

ts
interface VoiceProvider {
  textToSpeech(text: string, options?: Partial<VoiceConfig>): Promise<TTSResult>;
  speechToText?(audio: ArrayBuffer | Blob, options?: { language?: string }): Promise<STTResult>;
  listVoices?(): Promise<Array<{ id: string; name: string; preview_url?: string }>>;
}

Where to go next โ€‹

  • Vision โ€” image and multimodal inputs.
  • WebSocket โ€” realtime transport for voice sessions.
  • Hooks โ€” lifecycle hooks for monitoring speech latency.

Released under the MIT License.