projects:gemini_voice:audio_evolution
Table of Contents
🎤 Gemini Voice V2: Audio Architecture Evolution
This page tracks the different audio setups attempted during the V2 development to prevent repeating failed configurations.
1. The ALSA/PyAudio Attempt (V1 Style)
- Setup: Used standard PyAudio streams with standard ALSA device names.
- Result: FAILED.
- Failure Mode: “Expression 'alsa_snd_pcm_mmap_begin' failed.”
- Lessons: PyAudio struggled with ALSA memory mapping on Kubuntu 25.10 when multiple system apps held the audio device.
2. The Sounddevice Split-Thread Setup
- Setup: Switched to \`sounddevice\` for robustness. Microphone and Speaker were on separate \`InputStream\` and \`OutputStream\` threads.
- Result: UNSTABLE.
- Failure Mode: “1011 keepalive ping timeout.”
- Lessons: The separate hardware threads occasionally blocked the \`asyncio\` event loop, preventing the WebSocket from answering server pings.
3. The Half-Duplex (Mic Muting) Setup
- Setup: Mic was muted while the assistant was speaking to prevent echo.
- Result: FAILED.
- Failure Mode: Assistant would answer once, then lock up.
- Lessons: The VAD (Voice Activity Detection) state was getting stuck, and the AI was confused by the hard cut in audio data.
4. The "Tank" Duplex Engine (Final Production)
- Setup: Uses a single \`sd.Stream\` (Duplex) which handles both Mic and Speaker in one hardware-managed thread.
- Result: ACTIVE / TESTING.
- Key Features:
- 16kHz Unified Rate: Maximizes Bluetooth (q20i) and network stability.
- Thread-Safe Queues: Decouples hardware from the WebSocket loop completely.
- Ollama Integration: Gemini uses function calling to route system tasks to local brain.
- Self-Healing Loop: Background \`while\` loop automatically restarts on 1011 errors.
🛠️ Technical Baseline
- Library: \`sounddevice\` + \`numpy\`
- Transport: Multimodal Live API (v1alpha WebSockets)
- Encoding: 16-bit Little Endian PCM
projects/gemini_voice/audio_evolution.txt · Last modified: by 127.0.0.1
