MODULO 6.6

🎙️ Voice System

6
Topicos
~60
Minutos
Deep
Nivel
Source
Tipo
1

🎤 System Overview

Push-to-talk voice input pipeline converting speech to prompt text. Double-gated: GrowthBook feature flag (VOICE_MODE) AND Anthropic OAuth token required.

voice/voiceModeEnabled.ts

Auth + kill-switch checks.

services/voice.ts

Audio recording backends (NAPI/arecord/SoX).

services/voiceStreamSTT.ts

WebSocket STT client to voice_stream endpoint.

hooks/useVoice.ts

React hook wiring audio → WS → transcript.

hooks/useVoiceIntegration.tsx

Prompt-input integration, hold-threshold, interim rendering.

2

🎵 Audio Recording Backends

BackendPlatformsNotes
audio-capture-napimacOS, Linux, WindowsIn-process via cpal + CoreAudio. Primary backend.
arecordLinux onlyALSA utility. 150ms runtime probe detects non-functional installs.
SoX (rec)macOS, LinuxExternal process piping raw PCM. --buffer 1024 prevents delay.

💡 Why Runtime Probe

On WSL1 and headless Linux, arecord is installed but open() fails. hasCommand('arecord') returns true but device can't be opened. 150ms race: if alive after timer = OK.

3

🌐 WebSocket STT Protocol

Connects to wss://api.anthropic.com/api/ws/speech_to_text/voice_stream. Uses OAuth Bearer token. Target api.anthropic.com instead of claude.ai because Cloudflare TLS fingerprinting blocks non-browser clients.

KeepAlive

JSON control sent on open then every 8s.

Binary frames

Raw PCM audio chunks. Buffer.from() prevents NAPI shared-ArrayBuffer races.

CloseStream

JSON control signaling end of audio.

TranscriptText

Interim transcript chunks, revised by subsequent messages.

4

✋ Hold-to-Talk Mechanics

Terminal has no keyup event. System reconstructs hold by timing gaps between auto-repeat events (30-80ms). Hold threshold: 5 rapid presses for bare-character bindings (e.g., Space). Modifier combos activate on first press.

💡 State Synchronization

updateState('recording') called synchronously BEFORE any await. If async ran first, guard would see stale 'idle' and let spaces leak into prompt input.

5

🔄 Silent-Drop Replay

~1% of sessions hit server bug: sticky CE pod accepting audio but returning zero transcripts. Client detects pattern (no_data_timeout + hadAudioSignal + wsConnected) and replays full audio buffer on fresh WebSocket. Replay limited to once per session.

💡 Four Finalize Triggers

post_closestream_endpoint (~300ms), no_data_timeout (1.5s - silent-drop detector), ws_close (3-5s), safety_timeout (5s).

6

🌍 Language & Keyterms

normalizeLanguageForSTT() maps user's language setting to BCP-47 code. Falls back to 'en' with warning for unsupported languages. Up to 50 keyterms sent for STT boosting from hardcoded terms, project root, and git branch words.

📋 Resumo do Modulo

Voice is double-gated: GrowthBook VOICE_MODE flag AND Anthropic OAuth token required.
Backend fallback chain: NAPI → arecord → SoX. Memoized probe result ensures consistency.
Audio starts before WebSocket opens - PCM chunks buffer until onReady fires.
State transitions to 'recording' synchronously before any await to prevent leaked characters.
Hold threshold (5 rapid presses) prevents accidental activation for bare-character bindings.
Silent-drop replay is client-side workaround for server bug. Full audio buffer kept for one-shot replay.
Voltar