Voice โ
The voice module provides TTS/STT adapters and a real-time streaming session that wires together speech input โ agent reasoning โ spoken output.
ts
import {
OpenAIVoiceProvider,
ElevenLabsVoiceProvider,
createVoiceProvider,
VoiceStreamSession,
} from 'confused-ai';Text-to-Speech (TTS) โ
ts
import { OpenAIVoiceProvider } from 'confused-ai';
const voice = new OpenAIVoiceProvider({ apiKey: process.env.OPENAI_API_KEY });
const result = await voice.textToSpeech('Hello, how can I help you today?', {
voiceId: 'nova', // 'alloy' | 'echo' | 'fable' | 'onyx' | 'nova' | 'shimmer'
model: 'tts-1-hd',
speed: 1.0, // 0.25โ4.0
});
// result.audio โ ArrayBuffer
// result.format โ 'mp3' | 'opus' | 'aac' | 'flac' | 'wav' | 'pcm'
// result.durationSeconds
// result.characterCountSpeech-to-Text (STT) โ
ts
const audioBuffer: ArrayBuffer = await readAudioFile('./question.mp3');
const transcript = await voice.speechToText(audioBuffer, { language: 'en' });
// transcript.text โ transcribed string
// transcript.language โ detected language
// transcript.confidence โ 0โ1
// transcript.durationSecondsElevenLabs provider โ
ts
import { ElevenLabsVoiceProvider } from 'confused-ai';
const voice = new ElevenLabsVoiceProvider({
apiKey: process.env.ELEVENLABS_API_KEY!,
});
const result = await voice.textToSpeech('Welcome back!', {
voiceId: 'rachel', // ElevenLabs voice ID
});createVoiceProvider factory โ
ts
import { createVoiceProvider } from 'confused-ai';
const voice = createVoiceProvider({
provider: process.env.VOICE_PROVIDER as 'openai' | 'elevenlabs',
apiKey: process.env.VOICE_API_KEY!,
voiceId: 'nova',
model: 'tts-1',
});Real-time voice streaming โ
VoiceStreamSession connects a microphone stream to an agent and streams synthesised audio back:
ts
import { OpenAIVoiceProvider, VoiceStreamSession } from 'confused-ai';
import { createAgent } from 'confused-ai';
const voiceProvider = new OpenAIVoiceProvider();
const agent = createAgent({
name: 'voice-assistant',
instructions: 'You are a helpful voice assistant. Be concise.',
model: 'gpt-4o-mini',
apiKey: process.env.OPENAI_API_KEY!,
});
const session = new VoiceStreamSession({
stt: voiceProvider,
tts: voiceProvider,
run: async (text) => {
const result = await agent.run(text);
return result.text;
},
voiceId: 'nova',
silenceThresholdMs: 800, // ms of silence to trigger STT
});
// Push raw PCM audio chunks in real time (from microphone, WebSocket, etc.)
microphone.on('data', (chunk) => session.pushChunk(chunk));
// Consume events
for await (const event of session.events()) {
switch (event.type) {
case 'transcript':
console.log('User said:', event.text);
break;
case 'text_delta':
process.stdout.write(event.delta ?? '');
break;
case 'audio':
speaker.write(event.chunk); // play synthesised audio
break;
case 'agent_end':
console.log('Agent finished speaking');
break;
case 'error':
console.error('Voice error:', event.error);
break;
}
}
await session.end();VoiceStreamEvent types โ
type | Fields | Description |
|---|---|---|
transcript | text | STT result from user speech |
agent_start | โ | Agent processing began |
text_delta | delta | One token from agent response |
audio | chunk (Uint8Array) | TTS audio chunk (MP3) |
agent_end | โ | Agent finished responding |
error | error | Session-level error |
VoiceProvider interface โ
Implement this to add any TTS/STT backend:
ts
interface VoiceProvider {
textToSpeech(text: string, options?: Partial<VoiceConfig>): Promise<TTSResult>;
speechToText?(audio: ArrayBuffer | Blob, options?: { language?: string }): Promise<STTResult>;
listVoices?(): Promise<Array<{ id: string; name: string; preview_url?: string }>>;
}