Pronunciation Assessment API
Phone-level scores that exceed human agreement
The only commercial pronunciation API where the model is more consistent than trained human raters. Phone PCC 0.590 on our production Light tier and 0.682 on the Premium fine-tuned tier — both above the 0.555 inter-annotator ceiling on our public test set. AWS and GCP have no equivalent product; Azure offers 33 locales without publishing accuracy numbers.
Why this scoring matters
Pronunciation Correlation Coefficient (PCC) measures how closely automatic scores match the consensus of trained human raters. On the public our public test set benchmark, two trained humans agree at PCC 0.555 (phone-level). Our production model is at 0.590. Our Premium tier is at 0.682. Both exceed the ceiling that trained humans achieve with each other — meaning the API is, on average, more consistent than two graders comparing notes.
Phone PCC (Light)
0.590
+3.5pp above human (0.555). 9-model ensemble, runs on CPU.
Phone PCC (Premium)
0.682
+12.7pp above human, +2.5pp above SOTA (HIA 0.657). LoRA fine-tune on WavLM-large.
Sentence PCC
0.711
+3.6pp above human (0.675). Sentence-level via XGBoost stacked over phone+word features.
Latency p95
423ms
Light tier on shared GPU; Premium ~530ms warm. Both sync — no batch wait.
Use cases
Edtech & language learning
L2 English speakers practicing on mobile or web. Get phone-level color-coded scores so the learner sees exactly which sounds need work — not just a sentence-level grade.
Call-center accent training
Score agent recordings nightly to track pronunciation drift across cohorts. The phone-level granularity surfaces specific phonemes (/θ/, /ð/, vowel reductions) that managerial review can't catch in real time.
Accessibility & speech therapy
Speech-language pathologists need objective per-phoneme scores for clinical baselines. Our PCC exceeds inter-annotator agreement among trained human raters — the score is more consistent than humans.
Voice-acting / dubbing QA
Validate that dubbed lines meet target pronunciation before mastering. Catches mid-session drift in long dubbing sessions before re-records become expensive.
Voice-search / IVR fine-tuning
When an STT model misrecognizes accented speech, you usually don't know if it's the model or the speaker. Pronunciation scoring on the audio answers that — separating user-side issues from model-side issues.
Quality & price comparison
Quality on a 0–10 scale derived from published benchmarks; AWS and GCP have no equivalent product so they score 0. Price is per audio-minute in USD.
| Provider | Quality | Price/audio-min | vs market avg | Position |
|---|---|---|---|---|
| AWS (no offering) | 0.0/10 | $0 | 0% | — |
| GCP (no offering) | 0.0/10 | $0 | 0% | — |
| Azure Pronunciation Assessment | 8.5/10 | $1.32 | 153% | — |
| Speechace (B2B API) | 8.5/10 | $3.00 | 347% | — |
| ELSA Speak (consumer + API) | 8.0/10 | $0 | 0% | — |
| Brainiall LIGHT | 9.0/10 | $0.600(31% cheaper) | 69% | Superior |
| Brainiall PREMIUM | 9.5/10 | $2.40(-178% cheaper) | 278% | Superior |
Pricing rule: 90% off when inferior · 80% off at parity · 50% off when superior. Position determined by objective benchmark, refreshed quarterly.
Pricing
Light tier scores at production-grade quality with CPU inference; Premium tier uses fine-tuned WavLM-large for the highest published accuracy on our public test set. Discount rule: 50% off market average when our quality is superior — which it is, on both tiers.
Free
$0/mo
100 minutes/month · Light tier · Forever free
Starter
$19/mo
1,000 minutes/month · Light tier · Word + sentence scores
Pro
$99/mo
10,000 minutes/month · Light + Premium · Phone-level scores · 99.5% SLA
Business
$299/mo
50,000 minutes/month · Dedicated capacity · Direct line to founder
Two-line integration
Send raw audio (base64) plus the expected text. Get back per-phone, per-word and per-sentence scores — one HTTP call, no streaming required.
curl https://api.brainiall.com/v1/pronunciation/assess \
-H "Authorization: Bearer $BRAINIALL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"audio": "<base64-wav>",
"text": "She sells seashells by the seashore",
"tier": "premium"
}'
# response (abbreviated):
# {
# "sentenceScore": 86,
# "wordScores": [{"word":"she","score":92}, {"word":"sells","score":78}, ...],
# "phoneScores": [{"phone":"SH","score":94}, ...],
# "confidence": 0.97
# }Methodology
The Light tier is a 9-model ensemble (4× MLP + 4× XGBoost + 1× PhoneTransformer) over features extracted from a Brainiall Speech (Edge tier) acoustic model — same backbone we use for our STT product. The Premium tier adds LoRA fine-tuning on WavLM-large, learnable weighted-sum across the 24 transformer layers, plus ConPCO ordinal-contrastive loss. All numbers come from the public benchmark methodology page; the test set is our public test set (the SOTA reference for L2-English pronunciation).