🎤 System Overview
Push-to-talk voice input pipeline converting speech to prompt text. Double-gated: GrowthBook feature flag (VOICE_MODE) AND Anthropic OAuth token required.
voice/voiceModeEnabled.ts
Auth + kill-switch checks.
services/voice.ts
Audio recording backends (NAPI/arecord/SoX).
services/voiceStreamSTT.ts
WebSocket STT client to voice_stream endpoint.
hooks/useVoice.ts
React hook wiring audio → WS → transcript.
hooks/useVoiceIntegration.tsx
Prompt-input integration, hold-threshold, interim rendering.
🎵 Audio Recording Backends
| Backend | Platforms | Notes |
|---|---|---|
| audio-capture-napi | macOS, Linux, Windows | In-process via cpal + CoreAudio. Primary backend. |
| arecord | Linux only | ALSA utility. 150ms runtime probe detects non-functional installs. |
| SoX (rec) | macOS, Linux | External process piping raw PCM. --buffer 1024 prevents delay. |
💡 Why Runtime Probe
On WSL1 and headless Linux, arecord is installed but open() fails. hasCommand('arecord') returns true but device can't be opened. 150ms race: if alive after timer = OK.
🌐 WebSocket STT Protocol
Connects to wss://api.anthropic.com/api/ws/speech_to_text/voice_stream. Uses OAuth Bearer token. Target api.anthropic.com instead of claude.ai because Cloudflare TLS fingerprinting blocks non-browser clients.
KeepAlive
JSON control sent on open then every 8s.
Binary frames
Raw PCM audio chunks. Buffer.from() prevents NAPI shared-ArrayBuffer races.
CloseStream
JSON control signaling end of audio.
TranscriptText
Interim transcript chunks, revised by subsequent messages.
✋ Hold-to-Talk Mechanics
Terminal has no keyup event. System reconstructs hold by timing gaps between auto-repeat events (30-80ms). Hold threshold: 5 rapid presses for bare-character bindings (e.g., Space). Modifier combos activate on first press.
💡 State Synchronization
updateState('recording') called synchronously BEFORE any await. If async ran first, guard would see stale 'idle' and let spaces leak into prompt input.
🔄 Silent-Drop Replay
~1% of sessions hit server bug: sticky CE pod accepting audio but returning zero transcripts. Client detects pattern (no_data_timeout + hadAudioSignal + wsConnected) and replays full audio buffer on fresh WebSocket. Replay limited to once per session.
💡 Four Finalize Triggers
post_closestream_endpoint (~300ms), no_data_timeout (1.5s - silent-drop detector), ws_close (3-5s), safety_timeout (5s).
🌍 Language & Keyterms
normalizeLanguageForSTT() maps user's language setting to BCP-47 code. Falls back to 'en' with warning for unsupported languages. Up to 50 keyterms sent for STT boosting from hardcoded terms, project root, and git branch words.