HOW SCORING WORKS
How we score your pronunciation
This page explains step by step how Phraze measures pronunciation — the technology behind it, which numbers we show and why, and what happens to your data.
The 4-second path
The 4-second path
Between your recording and the score on your screen, four steps happen in under four seconds. The two key steps run on two separate systems with separate jobs: one measures, one reacts.
Recording
Your device captures WAV PCM 16kHz mono — the native format Azure reads directly — and sends the file over an encrypted connection.
Azure analyzes
Microsoft Azure Pronunciation Assessment compares your recording phoneme by phoneme against the reference text and returns objective scores for accuracy, fluency, and — for Mandarin — tones.
Claude reacts
Claude Haiku 4.5 receives only the numbers and syllable breakdown — not the audio file — and writes a 1–2 sentence reaction in your chosen tutor personality.
You see the result
The score ring shows your AccuracyScore, the tutor bubble shows the reaction, and the weakest syllable is highlighted.
What we record — and what we don't
What we record — and what we don't
Your recording is captured as WAV PCM 16kHz mono — the only format Azure Pronunciation Assessment reliably processes. The audio file is discarded immediately after analysis: Phraze does not store it, Azure does not store it, it never touches a hard drive. What remains is only the numerical assessment data, associated with your account.
- Format: WAV PCM 16kHz mono (16-bit, uncompressed)
- Storage location: briefly in server RAM during analysis
- Saved to disk? No — neither at Phraze nor at Azure
- Upload size: approximately 32 KB per second of audio
- What we keep: only the numerical scores and syllable breakdown
The truth layer: Azure Pronunciation Assessment
The truth layer: Azure Pronunciation Assessment
Azure Pronunciation Assessment is Microsoft's Speech AI for phonetic analysis — the same system used by language schools, universities, and learning applications worldwide. We use it because it is deterministic: the same recording always produces the same score, regardless of model mood or time of day.
What Azure measures:
How closely do your phonemes match the reference text? Calculated at the syllable and word level. This is the score we show.
Rhythm, pauses, speaking speed. How natural does the flow sound?
Intonation, stress, melody. Only available for certain languages.
Did you say the whole phrase, or did words go missing?
Why we show the AccuracyScore
Azure also produces a composite PronScore that combines AccuracyScore, FluencyScore, and CompletenessScore. The problem: on short phrases, Fluency and Completeness are nearly always 100 — if you say a single sentence, you don't pause in the middle of it or drop words. That inflates PronScore artificially, even when the phonemes were wrong. A concrete example: someone who says "leggar" instead of "Lecker" receives a PronScore of around 85%, because Fluency and Completeness are perfect. The AccuracyScore reads 75% — and the second syllable "cker" scores 29%. That is the honest number. We show it because it actually makes you better.
“We show you the score that's honest — not the one that feels good.”
Down to the syllable
Down to the syllable
Azure calculates AccuracyScores not just for the whole word, but for each individual syllable. Phraze automatically identifies the weakest syllable and passes that information to the tutor, so the feedback focuses on what actually went wrong. Using "Lecker" as an example: the syllable "le" scores 100%, the syllable "cker" scores only 29% — the tutor addresses exactly that gap.
Example
Lecker
Mandarin: every tone counts
Mandarin: every tone counts
In Mandarin, the tone of a syllable completely changes its meaning — mā (妈, mother), má (麻, hemp), mǎ (马, horse), and mà (骂, to scold) are four different words. Azure returns not just an AccuracyScore for each syllable, but also encodes the expected and the actually-heard tone as phoneme labels in SAPI format — for example "hao 4" for the fourth tone on "hao" or "chi 1" for the first tone on "chi". Phraze extracts this tone information separately and surfaces it as a dedicated tone-error list, so you can see exactly which syllable needed which tone.
The personality layer: Claude Haiku 4.5
The personality layer: Claude Haiku 4.5
Azure delivers numbers — precise, objective, and without personality. Claude Haiku 4.5 translates those numbers into a 1–2 sentence reaction that matches your chosen tutor personality and is written in your app language.
Why two models?
Azure can't write roasts — it returns scores. Claude can't measure phonemes — it interprets text. The combination produces something neither could do alone: an objective, traceable assessment delivered in a voice that feels like a character rather than a form field.
Three tutors, one truth
All three tutor modes react to the same score — 75% on "Lecker", weakest syllable "cker" at 29%. What changes is the personality, not the facts.
“Almost there! The 'cker' syllable still has room to improve — try making the 'ck' shorter and crisper. You're getting closer.”
“75% — not bad. But 'cker' at 29%? Sounds like you just daydreamed through the second syllable. Again, and this time stay awake.”
“First syllable was solid. The second was a crime against the German language. 'cker'. Say it. Crisp. Now.”
Claude only ever gets numbers and syllables — never the audio file itself.
What happens to your data
What happens to your data
- ✓Your audio recording is uploaded to Phraze, forwarded to Azure, then deleted. We do not store audio files.
- ✓Azure does not train its models on your audio — this is guaranteed contractually by the Microsoft Azure Speech API terms.
- ✓Claude Haiku 4.5 only ever sees the numerical scores and syllable breakdown — never your voice.
- ✓Your scores are stored in your account so you can track your progress over time. You can delete them at any time via Account deletion in settings.
- ✓Phraze runs its servers in the EU (NeonDB Frankfurt). Azure region: West Europe.
Frequently asked questions