Skip to main content

Pronunciation Assessment API
Phone-level scores that exceed human agreement

The only commercial pronunciation API where the model is more consistent than trained human raters. Phone PCC 0.590 on our production Light tier and 0.682 on the Premium fine-tuned tier — both above the 0.555 inter-annotator ceiling on our public test set. AWS and GCP have no equivalent product; Azure offers 33 locales without publishing accuracy numbers.

Why this scoring matters

Pronunciation Correlation Coefficient (PCC) measures how closely automatic scores match the consensus of trained human raters. On the public our public test set benchmark, two trained humans agree at PCC 0.555 (phone-level). Our production model is at 0.590. Our Premium tier is at 0.682. Both exceed the ceiling that trained humans achieve with each other — meaning the API is, on average, more consistent than two graders comparing notes.

Phone PCC (Light)

0.590

+3.5pp above human (0.555). 9-model ensemble, runs on CPU.

Phone PCC (Premium)

0.682

+12.7pp above human, +2.5pp above SOTA (HIA 0.657). LoRA fine-tune on WavLM-large.

Sentence PCC

0.711

+3.6pp above human (0.675). Sentence-level via XGBoost stacked over phone+word features.

Latency p95

423ms

Light tier on shared GPU; Premium ~530ms warm. Both sync — no batch wait.

Use cases

Edtech & language learning

L2 English speakers practicing on mobile or web. Get phone-level color-coded scores so the learner sees exactly which sounds need work — not just a sentence-level grade.

Call-center accent training

Score agent recordings nightly to track pronunciation drift across cohorts. The phone-level granularity surfaces specific phonemes (/θ/, /ð/, vowel reductions) that managerial review can't catch in real time.

Accessibility & speech therapy

Speech-language pathologists need objective per-phoneme scores for clinical baselines. Our PCC exceeds inter-annotator agreement among trained human raters — the score is more consistent than humans.

Voice-acting / dubbing QA

Validate that dubbed lines meet target pronunciation before mastering. Catches mid-session drift in long dubbing sessions before re-records become expensive.

Voice-search / IVR fine-tuning

When an STT model misrecognizes accented speech, you usually don't know if it's the model or the speaker. Pronunciation scoring on the audio answers that — separating user-side issues from model-side issues.

Quality & price comparison

Quality on a 0–10 scale derived from published benchmarks; AWS and GCP have no equivalent product so they score 0. Price is per audio-minute in USD.

ProviderQualityPrice/audio-minvs market avgPosition
AWS (no offering)0.0/10$00%
GCP (no offering)0.0/10$00%
Azure Pronunciation Assessment8.5/10$1.32153%
Speechace (B2B API)8.5/10$3.00347%
ELSA Speak (consumer + API)8.0/10$00%
Brainiall LIGHT9.0/10$0.600(31% cheaper)69%Superior
Brainiall PREMIUM9.5/10$2.40(-178% cheaper)278%Superior

Pricing rule: 90% off when inferior · 80% off at parity · 50% off when superior. Position determined by objective benchmark, refreshed quarterly.

Pricing

Light tier scores at production-grade quality with CPU inference; Premium tier uses fine-tuned WavLM-large for the highest published accuracy on our public test set. Discount rule: 50% off market average when our quality is superior — which it is, on both tiers.

Free

$0/mo

100 minutes/month · Light tier · Forever free

Starter

$19/mo

1,000 minutes/month · Light tier · Word + sentence scores

Pro

$99/mo

10,000 minutes/month · Light + Premium · Phone-level scores · 99.5% SLA

Business

$299/mo

50,000 minutes/month · Dedicated capacity · Direct line to founder

Two-line integration

Send raw audio (base64) plus the expected text. Get back per-phone, per-word and per-sentence scores — one HTTP call, no streaming required.

curl https://api.brainiall.com/v1/pronunciation/assess \
 -H "Authorization: Bearer $BRAINIALL_API_KEY" \
 -H "Content-Type: application/json" \
 -d '{
 "audio": "<base64-wav>",
 "text": "She sells seashells by the seashore",
 "tier": "premium"
 }'

# response (abbreviated):
# {
# "sentenceScore": 86,
# "wordScores": [{"word":"she","score":92}, {"word":"sells","score":78}, ...],
# "phoneScores": [{"phone":"SH","score":94}, ...],
# "confidence": 0.97
# }

Methodology

The Light tier is a 9-model ensemble (4× MLP + 4× XGBoost + 1× PhoneTransformer) over features extracted from a Brainiall Speech (Edge tier) acoustic model — same backbone we use for our STT product. The Premium tier adds LoRA fine-tuning on WavLM-large, learnable weighted-sum across the 24 transformer layers, plus ConPCO ordinal-contrastive loss. All numbers come from the public benchmark methodology page; the test set is our public test set (the SOTA reference for L2-English pronunciation).

Pronunciation, scored more consistently than humans

Get free API key See competitive comparison

Pronunciation Assessment API — Phone PCC 0.59 / 0.68 (exceeds human) · Brainiall | Brainiall