Audio Mode

Audio mode is for teams that want raw audio, not text turn-taking.

You use this when your agent stack already speaks WebSocket and you care about voice-native control.

Audio mode costs $0.13/min for Zone A (US/Canada), same as webhook mode. You pay for telephony only. International destinations use Zone B (×2) and Zone C (×3) — see Voice zones.

When to use it

  • OpenAI Realtime
  • custom ASR or TTS
  • interruption handling under your control
  • low-latency voice orchestration

Create an audio line

$curl -X POST https://saperly.com/api/v1/lines \
> -H "Authorization: Bearer sk_live_..." \
> -H "Content-Type: application/json" \
> -d '{
> "name": "Realtime voice line",
> "mode": "audio",
> "audio_handler_url": "wss://your-app.com/voice",
> "status_callback_url": "https://your-app.com/status"
> }'

Connection flow

  1. an inbound or outbound call starts
  2. Saperly gives your system a relay URL
  3. your system connects over WebSocket
  4. audio frames move in both directions

Relay messages you receive

1{
2 "type": "call_started",
3 "call_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
4}
1{
2 "type": "audio",
3 "payload": "<base64-encoded-audio>",
4 "timestamp": "1711900005000"
5}
1{
2 "type": "call_ended",
3 "call_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
4 "duration_sec": 45
5}

Messages you send back

1{
2 "type": "audio",
3 "payload": "<base64-encoded-audio>"
4}

Audio formats

Saperly can bridge between carrier audio and the format your agent stack wants. Use a format query parameter on the relay URL to select the codec.

FormatSample rateBit depthFrame size (20ms)Use case
mulaw_8k8kHz8-bit160 bytesCarrier native, lowest bandwidth
pcm16_8k8kHz16-bit320 bytesBasic ASR
pcm16_16k16kHz16-bit640 bytesMost modern ASR/TTS (recommended)
pcm16_24k24kHz16-bit960 bytesHigh-quality TTS, OpenAI Realtime

pcm16_16k is usually the sane default when connecting to modern realtime models. Use pcm16_24k when your downstream model expects it (OpenAI Realtime), and mulaw_8k when you want to stay closest to the carrier stream.

TypeScript WebSocket client

A minimal handler that decodes inbound frames, processes them, and writes audio back.

1import WebSocket from 'ws';
2
3function handleAudioRelay(relayUrl: string) {
4 const ws = new WebSocket(relayUrl);
5
6 ws.on('message', (data) => {
7 const msg = JSON.parse(data.toString());
8
9 switch (msg.type) {
10 case 'call_started':
11 console.log(`Call ${msg.call_id} started`);
12 break;
13 case 'audio':
14 // msg.payload is base64-encoded audio
15 const audioBuffer = Buffer.from(msg.payload, 'base64');
16 // Feed to your ASR/model pipeline
17 processAudio(audioBuffer);
18 break;
19 case 'call_ended':
20 console.log(`Call ended after ${msg.duration_sec}s`);
21 ws.close();
22 break;
23 }
24 });
25
26 // Send audio back
27 function sendAudio(audioBuffer: Buffer) {
28 ws.send(JSON.stringify({
29 type: 'audio',
30 payload: audioBuffer.toString('base64'),
31 }));
32 }
33}

Rate limiting and backpressure

Audio frames arrive at the call’s sample rate (e.g., 50 frames/sec at 20ms per frame for pcm16_16k). If your handler cannot keep up, frames are dropped. Monitor your WebSocket buffer size and processing latency.

Practical guidance:

  • Keep per-frame processing under the frame interval (20ms). Offload heavy work to a queue.
  • Watch ws.bufferedAmount on your outbound socket. If it climbs, your sender is faster than the network.
  • Prefer small frames and steady cadence over large chunks sent in bursts.

Integration: OpenAI Realtime API

1

Create an audio line with pcm16_24k

Set audio_handler_url to your WebSocket server and use ?format=pcm16_24k on the relay URL.
2

On call_started, open a session to OpenAI Realtime

Forward audio frames from Saperly to the OpenAI Realtime session.
3

Forward OpenAI audio responses back to Saperly

Send base64-encoded audio frames back through the Saperly relay.

Troubleshooting

SymptomLikely causeFix
No audio receivedWrong relay URL formatCheck ?format= parameter
Garbled audioCodec mismatchEnsure sender and receiver use same format
High latencyProcessing bottleneckProfile your audio pipeline, check buffer sizes
Dropped framesBackpressureIncrease consumer throughput or buffer
Connection dropsTimeoutSend keepalive pings every 30 seconds

Build advice

Do not start here unless you already know why webhook mode is insufficient.

Audio mode is powerful, but it has more failure surfaces:

  • dropped frames
  • timing drift
  • backpressure
  • interruption logic
  • codec mismatches

If your product does not need that level of control, start with hosted mode or webhook mode instead.