Audio Mode
Audio mode is for teams that want raw audio, not text turn-taking.
You use this when your agent stack already speaks WebSocket and you care about voice-native control.
When to use it
- OpenAI Realtime
- custom ASR or TTS
- interruption handling under your control
- low-latency voice orchestration
Create an audio line
Connection flow
- an inbound or outbound call starts
- Saperly gives your system a relay URL
- your system connects over WebSocket
- audio frames move in both directions
Relay messages you receive
Messages you send back
Audio formats
Saperly can bridge between carrier audio and the format your agent stack wants. Use a format query parameter on the relay URL to select the codec.
pcm16_16k is usually the sane default when connecting to modern realtime models. Use pcm16_24k when your downstream model expects it (OpenAI Realtime), and mulaw_8k when you want to stay closest to the carrier stream.
TypeScript WebSocket client
A minimal handler that decodes inbound frames, processes them, and writes audio back.
Rate limiting and backpressure
Practical guidance:
- Keep per-frame processing under the frame interval (20ms). Offload heavy work to a queue.
- Watch
ws.bufferedAmounton your outbound socket. If it climbs, your sender is faster than the network. - Prefer small frames and steady cadence over large chunks sent in bursts.
Integration: OpenAI Realtime API
Create an audio line with pcm16_24k
Setaudio_handler_url to your WebSocket server and use ?format=pcm16_24k on the relay URL.Troubleshooting
Build advice
Do not start here unless you already know why webhook mode is insufficient.
Audio mode is powerful, but it has more failure surfaces:
- dropped frames
- timing drift
- backpressure
- interruption logic
- codec mismatches
If your product does not need that level of control, start with hosted mode or webhook mode instead.
