Streaming STT API
Voice-agent ready WebSocket transcription
Send audio chunks over WebSocket, receive partial transcripts as they're produced. Brainiall Speech (Pro tier) backbone, 99 languages, drop-in for voice agents and live captioning. First partial transcripted in roughly 1.5s; tighter buffering on roadmap (Phase 2: VAD + 500ms flush).
How it works
- Connect to
wss://api.brainiall.com/v1/stt/streamwith Bearer authentication. - Send binary frames containing 16 kHz int16 PCM mono audio chunks (any size). Server buffers internally.
- Receive JSON text frames every ~1.5 s of accumulated audio:
{"type":"partial","text":"...","offset_ms":0,"duration_ms":1500} - Send text frame
"__END__"to flush remaining buffer. Receive final transcript:{"type":"final","text":"...","chunks_processed":N}
# Python client (websockets lib)
import asyncio, json, websockets, wave
async def stream(wav_path):
async with websockets.connect(
"wss://api.brainiall.com/v1/stt/stream",
additional_headers=[("Authorization", "Bearer $BRAINIALL_API_KEY")],
) as ws:
with wave.open(wav_path, "rb") as wf:
data = wf.readframes(wf.getnframes)
# Send in 200ms chunks
chunk = 16000 * 2 // 5 # int16 @ 16kHz, 200ms
for i in range(0, len(data), chunk):
await ws.send(data[i:i+chunk])
await asyncio.sleep(0.2)
await ws.send("__END__")
async for msg in ws:
event = json.loads(msg)
print(event["type"], event["text"])
if event["type"] == "final":
break
asyncio.run(stream("audio.wav"))Use cases
Voice agents (LLM + tool-use)
Real-time conversational agents need transcripts as the user is still speaking. WebSocket emits partial transcripts every 1.5s — your agent reasons in parallel with the user finishing their sentence.
Live captioning
Conferences, lectures, accessibility tools. Stream audio in, render partials on screen with low latency. 99 languages out of the box.
Phone-call analytics
Real-time supervisor dashboards for call centers. Stream agent + customer audio, get partials for compliance flagging without batch wait.
Voice search / dictation
Application voice input where users want to see transcripts appear word-by-word. Similar UX to native OS dictation but through your API.
Multi-language live translation chain
Pair with our Translation API (S10): stream STT → translate partial → emit. End-to-end live translation pipeline in a single Bearer key.
Quality & price comparison
Quality on a 0–10 scale derived from published WER benchmarks (clean-speech benchmarks, multilingual benchmarks, CallHome). Price is per audio-minute in USD.
| Provider | Quality | Price/audio-min | vs market avg | Position |
|---|---|---|---|---|
| Deepgram Nova-3 (streaming) | 9.5/10 | $0.0077 | 61% | — |
| AssemblyAI Universal-Streaming | 9.0/10 | $0.0025 | 20% | — |
| AWS Transcribe Streaming | 8.0/10 | $0.024 | 191% | — |
| GCP Chirp 3 streaming | 9.0/10 | $0.016 | 127% | — |
| Brainiall STREAMING | 8.5/10 | $0.0050(60% cheaper) | 40% | Parity |
Pricing rule: 90% off when inferior · 80% off at parity · 50% off when superior. Position determined by objective benchmark, refreshed quarterly.
Where we're honest
- First-partial latency: ~1.5 s today (buffer-by-time MVP). Deepgram Nova-3 streaming is closer to ~300 ms first-partial. We're working on Silero VAD-based segmentation + 500 ms flush in Phase 2.
- Sliding window: Each partial is transcribed in isolation today (no overlap). The transcription engine occasionally hallucinates on short isolated segments. Phase 2 adds 500 ms overlap to fix this.
- Speaker diarization: Not in streaming mode yet (only batch). Our existing batch /transcribe endpoint includes Brainiall Speaker ID engine diarization.
- Languages: Auto-language detection (good for 99 languages). For tighter latency, future
language=parameter will skip detection.
Pricing
Discount derived from quality position. 80% off at parity with the industry leaders on coverage and accuracy.
Free
$0/mo
60 minutes/month · Brainiall Speech (Pro tier) · Forever free
Starter
$19/mo
1,200 minutes/month · 99 languages · Standard latency
Pro
$99/mo
10,000 minutes/month · Priority queue · 99.5% SLA
Business
$299/mo
50,000 minutes/month · Dedicated capacity · Email + Slack