Brainiall Speech Live visualized — continuous audio waveform flowing into transcript bubbles with real-time partial-result indicators

Brainiall Speech Live
Voice-agent ready WebSocket transcription

Name: Brainiall Speech Live
Brand: Brainiall
SKU: streaming-stt
Availability: InStock
Rating: 4.4 (12 reviews)

Send audio chunks over WebSocket, receive partial transcripts as they're produced. Brainiall Speech (Pro tier) backbone, 99 languages, drop-in for voice agents and live captioning. First partial transcripted in roughly 1.5s; tighter buffering on roadmap (Phase 2: VAD + 500ms flush).

Get free API key Read docs See full STT comparison →

How it works

Connect to wss://api.brainiall.com/v1/stt/stream with Bearer authentication.
Send binary frames containing 16 kHz int16 PCM mono audio chunks (any size). Server buffers internally.
Receive JSON text frames every ~1.5 s of accumulated audio: {"type":"partial","text":"...","offset_ms":0,"duration_ms":1500}
Send text frame "__END__" to flush remaining buffer. Receive final transcript: {"type":"final","text":"...","chunks_processed":N}

# Python client (websockets lib)
import asyncio, json, websockets, wave

async def stream(wav_path):
 async with websockets.connect(
 "wss://api.brainiall.com/v1/stt/stream",
 additional_headers=[("Authorization", "Bearer $BRAINIALL_API_KEY")],
 ) as ws:
 with wave.open(wav_path, "rb") as wf:
 data = wf.readframes(wf.getnframes)
 # Send in 200ms chunks
 chunk = 16000 * 2 // 5 # int16 @ 16kHz, 200ms
 for i in range(0, len(data), chunk):
 await ws.send(data[i:i+chunk])
 await asyncio.sleep(0.2)
 await ws.send("__END__")
 async for msg in ws:
 event = json.loads(msg)
 print(event["type"], event["text"])
 if event["type"] == "final":
 break

asyncio.run(stream("audio.wav"))

Use cases

Voice agents (LLM + tool-use)

Real-time conversational agents need transcripts as the user is still speaking. WebSocket emits partial transcripts every 1.5s — your agent reasons in parallel with the user finishing their sentence.

Live captioning

Conferences, lectures, accessibility tools. Stream audio in, render partials on screen with low latency. 99 languages out of the box.

Phone-call analytics

Real-time supervisor dashboards for call centers. Stream agent + customer audio, get partials for compliance flagging without batch wait.

Voice search / dictation

Application voice input where users want to see transcripts appear word-by-word. Similar UX to native OS dictation but through your API.

Multi-language live translation chain

Pair with our Translation API (S10): stream STT → translate partial → emit. End-to-end live translation pipeline in a single Bearer key.

Quality & price comparison

Quality on a 0–10 scale derived from published WER benchmarks (clean-speech benchmarks, multilingual benchmarks, CallHome). Price is per audio-minute in USD.

Provider	Quality	Price/audio-min	vs market avg	Position
Deepgram Nova-3 (streaming)	9.5/10	$0.0077	61%	—
AssemblyAI Universal-Streaming	9.0/10	$0.0025	20%	—
AWS Transcribe Streaming	8.0/10	$0.024	191%	—
GCP Chirp 3 streaming	9.0/10	$0.016	127%	—
Brainiall STREAMING	8.5/10	$0.0050(60% cheaper)	40%	Parity

Pricing rule: 90% off when inferior · 80% off at parity · 50% off when superior. Position determined by objective benchmark, refreshed quarterly. Market average excludes retired / free / no-offer entries.

Where we're honest

First-partial latency: ~1.5 s today (buffer-by-time MVP). Deepgram Nova-3 streaming is closer to ~300 ms first-partial. We're working on VAD-based segmentation + 500 ms flush in Phase 2.
Sliding window: Each partial is transcribed in isolation today (no overlap). The transcription engine occasionally hallucinates on short isolated segments. Phase 2 adds 500 ms overlap to fix this.
Brainiall Speaker ID Not in streaming mode yet (only batch). Our existing batch /transcribe endpoint includes Brainiall Speaker ID engine diarization.
Languages: Auto-language detection (good for 99 languages). For tighter latency, future language= parameter will skip detection.
Batch transcription latency (CPU): the same Brainiall Speech (Pro tier) engine also serves the non-streaming /v1/whisper/transcribe batch endpoint. On the current CPU fleet, a warm 4-second clip transcribes in ~6–7 s, and the cost grows roughly linearly with audio length. For audio > 60 s we recommend chunking on the client (e.g. 30 s segments) and stitching transcripts together — that keeps individual request latency bounded and parallelises cleanly.

Pricing

Discount derived from quality position. 80% off at parity with the industry leaders on coverage and accuracy.

Free

$0/mo

60 minutes/month · Brainiall Speech (Pro tier) · Forever free

Starter

$19/mo

1,200 minutes/month · 99 languages · Standard latency

Pro

$99/mo

10,000 minutes/month · Priority queue · 99.5% SLA

Business

$299/mo

50,000 minutes/month · Dedicated capacity · Email + Slack

Streaming STT, transparent latency, 1/3 the Deepgram price

Get free API key See competitive comparison

Brainiall Speech LiveVoice-agent ready WebSocket transcription