HOW SCORING WORKS

How we score your pronunciation

This page explains step by step how Phraze measures pronunciation — the technology behind it, which numbers we show and why, and what happens to your data.

← Back to home

The 4-second path

Between your recording and the score on your screen, four steps happen in under four seconds. The two key steps run on two separate systems with separate jobs: one measures, one reacts.

Recording

Your device captures WAV PCM 16kHz mono — the native format Azure reads directly — and sends the file over an encrypted connection.

Azure analyzes

Microsoft Azure Pronunciation Assessment compares your recording phoneme by phoneme against the reference text and returns objective scores for accuracy, fluency, and — for Mandarin — tones.

Claude reacts

Claude Haiku 4.5 receives only the numbers and syllable breakdown — not the audio file — and writes a 1–2 sentence reaction in your chosen tutor personality.

You see the result

The score ring shows your AccuracyScore, the tutor bubble shows the reaction, and the weakest syllable is highlighted.

What we record — and what we don't

Your recording is captured as WAV PCM 16kHz mono — the only format Azure Pronunciation Assessment reliably processes. The audio file is discarded immediately after analysis: Phraze does not store it, Azure does not store it, it never touches a hard drive. What remains is only the numerical assessment data, associated with your account.

Format: WAV PCM 16kHz mono (16-bit, uncompressed)
Storage location: briefly in server RAM during analysis
Saved to disk? No — neither at Phraze nor at Azure
Upload size: approximately 32 KB per second of audio
What we keep: only the numerical scores and syllable breakdown

The truth layer: Azure Pronunciation Assessment

Azure Pronunciation Assessment is Microsoft's Speech AI for phonetic analysis — the same system used by language schools, universities, and learning applications worldwide. We use it because it is deterministic: the same recording always produces the same score, regardless of model mood or time of day.

What Azure measures:

← the honest oneAccuracyScore

How closely do your phonemes match the reference text? Calculated at the syllable and word level. This is the score we show.

FluencyScore

Rhythm, pauses, speaking speed. How natural does the flow sound?

ProsodyScore

Intonation, stress, melody. Only available for certain languages.

CompletenessScore

Did you say the whole phrase, or did words go missing?

Why we show the AccuracyScore

Azure also produces a composite PronScore that combines AccuracyScore, FluencyScore, and CompletenessScore. The problem: on short phrases, Fluency and Completeness are nearly always 100 — if you say a single sentence, you don't pause in the middle of it or drop words. That inflates PronScore artificially, even when the phonemes were wrong. A concrete example: someone who says "leggar" instead of "Lecker" receives a PronScore of around 85%, because Fluency and Completeness are perfect. The AccuracyScore reads 75% — and the second syllable "cker" scores 29%. That is the honest number. We show it because it actually makes you better.

“We show you the score that's honest — not the one that feels good.”

Down to the syllable

Azure calculates AccuracyScores not just for the whole word, but for each individual syllable. Phraze automatically identifies the weakest syllable and passes that information to the tutor, so the feedback focuses on what actually went wrong. Using "Lecker" as an example: the syllable "le" scores 100%, the syllable "cker" scores only 29% — the tutor addresses exactly that gap.

Example

Lecker

le100%cker29%

Mandarin: every tone counts

In Mandarin, the tone of a syllable completely changes its meaning — mā (妈, mother), má (麻, hemp), mǎ (马, horse), and mà (骂, to scold) are four different words. Azure returns not just an AccuracyScore for each syllable, but also encodes the expected and the actually-heard tone as phoneme labels in SAPI format — for example "hao 4" for the fourth tone on "hao" or "chi 1" for the first tone on "chi". Phraze extracts this tone information separately and surfaces it as a dedicated tone-error list, so you can see exactly which syllable needed which tone.

mā妈Mutter / mother

má麻Hanf / hemp

mǎ马Pferd / horse

mà骂schimpfen / scold

The personality layer: Claude Haiku 4.5

Azure delivers numbers — precise, objective, and without personality. Claude Haiku 4.5 translates those numbers into a 1–2 sentence reaction that matches your chosen tutor personality and is written in your app language.

Why two models?

Azure can't write roasts — it returns scores. Claude can't measure phonemes — it interprets text. The combination produces something neither could do alone: an objective, traceable assessment delivered in a voice that feels like a character rather than a form field.

Three tutors, one truth

All three tutor modes react to the same score — 75% on "Lecker", weakest syllable "cker" at 29%. What changes is the personality, not the facts.

Encouraging (free)

“Almost there! The 'cker' syllable still has room to improve — try making the 'ck' shorter and crisper. You're getting closer.”

Cheeky (Premium)

“75% — not bad. But 'cker' at 29%? Sounds like you just daydreamed through the second syllable. Again, and this time stay awake.”

Ruthless (Pro)

“First syllable was solid. The second was a crime against the German language. 'cker'. Say it. Crisp. Now.”

Claude only ever gets numbers and syllables — never the audio file itself.

What happens to your data

✓Your audio recording is uploaded to Phraze, forwarded to Azure, then deleted. We do not store audio files.
✓Azure does not train its models on your audio — this is guaranteed contractually by the Microsoft Azure Speech API terms.
✓Claude Haiku 4.5 only ever sees the numerical scores and syllable breakdown — never your voice.
✓Your scores are stored in your account so you can track your progress over time. You can delete them at any time via Account deletion in settings.
✓Phraze runs its servers in the EU (NeonDB Frankfurt). Azure region: West Europe.

Full Privacy Policy →

Frequently asked questions

Because we show the AccuracyScore, not the composite PronScore. Other apps show the composite value, which is inflated by near-perfect Fluency and Completeness scores on short phrases. Our score is stricter — and more informative because of it.

No. The analysis runs on Azure's servers and requires a network connection. You can still listen to phrases offline — but scoring is not possible without connectivity.

Marginal differences are possible due to microphone hardware variation. In practice, a signal-to-noise ratio above 15 dB is sufficient, and current smartphones achieve that easily. A quiet environment matters more than the device.

No. The pronunciation score is the Azure call — there is no alternative implementation. If you prefer not to use the analysis, you can still listen to and collect phrases, but you won't receive a score. We have no plans for an on-device alternative at launch.

Approximately $0.002 per attempt (Azure ~$0.0014 + Claude Haiku ~$0.0005). On the Pro plan, a fair-use cap of 30 assessments per day applies — a heavy user costs us roughly $1.80 per month.

Email support@phraze.app. A Discord server is in the works.

Try it yourself

Download Phraze and hear what your tutor has to say.

Back