What Are You 95% Confident About, Exactly?

Stateazy Series

Why Is It Called "Confidence" When Nobody Feels Confident Reading It?

The Problem First

You're a medicine resident. You read an abstract:

"Drug X reduced systolic BP by 12 mmHg (95% CI: 3 to 21, p=0.01)."

Your senior asks: "What does the confidence interval tell you?"

You say: "There's a 95% chance the true effect is between 3 and 21."

You're wrong. And so is almost every doctor who's ever said that sentence. That interpretation — the most intuitive one, the one that feels natural — is technically incorrect. And the reason it's wrong is buried in the naming history of who invented this thing and what they actually meant by "confidence."

The correct interpretation is weirder, less intuitive, and more important than you think. Let's get there.

Before the Term — What Problem Are We Solving?

You run a study. You find that Drug X lowers BP by 12 mmHg in your sample of 80 patients.

But you know this 12 is a sample estimate. A different sample of 80 patients might give you 9. Or 15. Or 7. The true population effect is unknown and unknowable (you'd need to treat every human on earth).

So instead of reporting a single number that pretends to be exact, you want to report a range that accounts for the uncertainty. A range that says: "Given the data we have, the true effect is plausible somewhere in here."

That range is the confidence interval. But WHY is it called that?

Word Surgery: "Confidence Interval"

"Confidence"

Root: Latin con- (with, together) + fidere (to trust, to have faith) → confidentia = "with trust" / "full trust"

Literal meaning: "the degree to which the METHOD is trustworthy"

Critical distinction: The "confidence" is NOT in any single interval. It's in the PROCEDURE that generates intervals. More on this in a moment — this is the single most misunderstood word in all of statistics.

"Interval"

Root: Latin inter- (between) + vallum (wall, rampart) → intervallum = "the space between two walls"

Originally a military term — the gap between fortification walls. In statistics: the gap between two numbers (the lower and upper bounds).

So "confidence interval" literally = "a space between two walls that you can trust."

→ Aha: Think of it as two goalposts. The CI tells you: "the true value is almost certainly somewhere between these posts." The "confidence" is in how reliably the procedure places the posts.

Naming Family

Term	What It Is	How It Differs
Confidence Interval (CI)	Frequentist range for an unknown parameter	Based on repeated sampling logic
Credible Interval	Bayesian cousin — a range with a direct probability interpretation	"95% probability the parameter is in this range" (what people THINK CI means)
Prediction Interval	Range for a FUTURE individual observation	Wider than CI because it includes individual variability + estimation uncertainty
Tolerance Interval	Range expected to contain a specified proportion of the population	Used in manufacturing/quality control
Reference Range	Mean ± 2 SD in a healthy population	Describes individual variability, not estimation uncertainty

The confusion: People want the credible interval interpretation (direct probability). They're given the confidence interval interpretation (procedure reliability). The names sound similar. The meanings are fundamentally different. This naming collision causes more misunderstanding than any other concept in medical statistics.

Who Invented This? — The Neyman Story

Jerzy Neyman (1934) — The Inventor

Jerzy Neyman, a Polish-born mathematician working in London, introduced the concept of confidence intervals in a 1934 paper presented to the Royal Statistical Society.

Neyman was solving a very specific problem: how to make interval estimates that have guaranteed long-run reliability.

His framework was deliberately frequentist — he wanted a procedure with a known error rate, not a statement about any individual interval.

What Neyman Actually Meant by "Confidence"

Neyman's original framing (paraphrased):

"If we use this procedure to construct intervals from repeated samples, 95% of the intervals we construct will contain the true parameter."

The confidence is in the PROCEDURE, not in any single interval.

Analogy: A factory makes parachutes. "95% confidence" means 95% of their parachutes open successfully. Once you're falling with a specific parachute, it either opens or it doesn't — you can't say there's a "95% chance" THIS parachute works. It either contains the true value or it doesn't. You just don't know which.

This is why "there's a 95% chance the true value is in this interval" is technically wrong. The true value is either in the interval or it isn't. The 95% refers to the procedure's long-run success rate, not to this specific interval's probability of being correct.

Why "Confidence" and Not "Probability"?

Neyman deliberately avoided the word "probability" for the interval because:

In the frequentist framework, the parameter is a fixed (unknown) number, not a random variable. You can't assign a probability to a fixed number being in a range.
The word "probability" was already used by Bayesians (who DO assign probability to parameters). Neyman wanted to distinguish his approach.
He chose "confidence" to convey "trustworthiness of the method" without implying "probability of the specific result."

The naming was intentional disambiguation. Neyman picked "confidence" SPECIFICALLY BECAUSE it wasn't "probability." Then generations of students interpreted "confidence" as "probability" anyway — defeating his entire purpose.

Ronald Fisher's Rival: Fiducial Intervals

Ronald Fisher, Neyman's bitter rival, had his own competing concept: fiducial intervals (from Latin fiducia = trust, faith — same root as "confidence").

Fisher's fiducial approach DID allow probability statements about parameters, but his mathematical framework was inconsistent and was eventually abandoned by most statisticians. Neyman's confidence interval won.

The irony: Fisher's fiducial interval gave the interpretation everyone wants (probability of the parameter). Neyman's confidence interval gives the interpretation that's mathematically rigorous but counterintuitive. The rigorous one won. The intuitive interpretation persists as a zombie — everyone says it, most textbooks tacitly allow it, and strict frequentists cringe every time.

The Dictionary Problem — Why the Name Misleads

Source	"Confidence" means...
Oxford English Dictionary	"The feeling or belief that one can have faith in or rely on someone or something"
Everyday speech	"I'm confident" = "I believe this is true"
Neyman's statistics	"The procedure is reliable" = "If repeated infinitely, 95% of intervals would contain the truth"

The collision: Everyday "confidence" is about belief in a specific claim. Statistical "confidence" is about the reliability of a repeatable procedure. When a paper says "95% CI: 3 to 21," your brain reads it as "I'm 95% confident the answer is between 3 and 21." Neyman meant: "This method produces correct intervals 95% of the time."

The practical difference? Honestly, for clinical decision-making, it's small. The Bayesian credible interval and the frequentist confidence interval often give similar numbers. The philosophical distinction matters for statisticians. What matters for YOU as a clinician is understanding what the interval TELLS you, regardless of the philosophical debate.

What the Interval Actually Tells You — The Practical Version

Forget the philosophical debate. Here's what a CI gives you clinically:

The Width = Your Uncertainty

CI	Width	What It Tells You
95% CI: 10 to 14 mmHg	Narrow (4 mmHg)	Precise estimate. The true effect is well-pinned. You can trust the point estimate.
95% CI: 2 to 22 mmHg	Wide (20 mmHg)	Imprecise. The drug might barely work (2 mmHg) or work brilliantly (22 mmHg). You don't really know.
95% CI: -3 to 21 mmHg	Crosses zero	The true effect might be NEGATIVE (drug harmful) or positive. The study cannot rule out no effect.

The Three Clinical Readings

Reading 1: Does the CI cross the null?

For differences: does it cross 0?
For ratios (OR, HR, RR): does it cross 1?
If yes → effect is not statistically significant at the chosen α level.

Reading 2: Where does the CI sit relative to clinical importance?

This is where CI beats the p-value. A p-value says "significant or not." A CI shows you the RANGE of plausible effects.

Example: Minimum clinically important difference for BP = 5 mmHg.

Scenario	95% CI	p-value	Clinical Interpretation
A	8 to 16 mmHg	<0.001	Entirely above 5 mmHg → clinically AND statistically significant
B	1 to 7 mmHg	0.02	Partially below 5 mmHg → statistically significant but might not be clinically important
C	-1 to 3 mmHg	0.30	Crosses zero → neither statistically nor clinically significant
D	6 to 22 mmHg	0.03	Above 5 mmHg but very wide → statistically significant, probably clinically important, but imprecise (underpowered?)

The p-value for scenarios A and B are both "significant." The CI shows you that A is a clear win and B is marginal. This is why the CI is more informative than the p-value.

Reading 3: Is the CI narrow enough for a decision?

Even a non-significant result can be useful if the CI is narrow enough to rule out a clinically important effect. "95% CI: -1 to 2 mmHg" — the drug doesn't work, and we're fairly sure. That's a useful negative result.

"95% CI: -8 to 15 mmHg" — the drug might cause 8 mmHg harm or 15 mmHg benefit. This study told you nothing. The CI is too wide. Underpowered.

The Formula — What Builds a CI

The Basic Structure

CI = Point Estimate ± (Critical Value × Standard Error)

For a 95% CI of a mean:

CI = x̄ ± 1.96 × SEM

Word Surgery on Each Component

"Point Estimate"

Roots: Point (Latin punctum = a prick, a dot → a single precise value) + Estimate (Latin aestimare = to value, to appraise)

→ "A single-dot appraisal" — your best single guess of the truth. The sample mean, the sample proportion, the observed hazard ratio.

Why this name? Because on a number line, it's a single POINT. The CI expands that point into a range.

"Critical Value"

Roots: Critical (Greek kritikos = able to judge, from krinein = to separate, to decide) + Value (Latin valere = to be worth)

→ "The judging number" — the threshold that determines how wide to make the interval. For 95% CI, the critical value is 1.96 (from the standard normal distribution). For 99% CI, it's 2.576.

Why 1.96? Because in a standard Gaussian distribution, 95% of values fall within ±1.96 standard deviations of the mean. This number comes directly from the mathematics of the bell curve — Gauss's legacy embedded in every CI you'll ever read.

Naming family: Critical value → critical region → rejection region. All use "critical" in the sense of "decision-making boundary."

"Standard Error"

Roots: Standard (the agreed-upon reference measure) + Error (Latin errare = to wander, to stray)

→ "The standard amount by which your estimate wanders from the truth"

Not "error" as in mistake. Error in statistics means natural variability — the unavoidable wandering of estimates around the true value. The standard error is the standard (typical) amount of that wandering.

SEM = SD / √n. This is why sample size matters: larger n → smaller SEM → narrower CI → more precise estimate.

What Determines CI Width?

Three things:

Factor	Effect on CI Width	Why
Sample size (n)	n↑ → CI narrows	SEM = SD/√n. More patients → less wandering of the mean
Variability (SD)	SD↑ → CI widens	More spread in individual patients → more uncertainty about the mean
Confidence level	99% > 95% > 90%	Higher confidence demands wider nets. The price of being more "sure" is less precision.

The tradeoff: You can have high confidence OR narrow intervals, not both (at a fixed sample size). Want 99.9% confidence? Your interval will be so wide it's useless. Want a razor-thin interval? Drop to 80% confidence. The sweet spot — 95% — is a convention, not a commandment.

Why 95%? Who Chose That Number?

Ronald Fisher (1925) in Statistical Methods for Research Workers suggested 5% (α = 0.05) as a convenient threshold for significance, which corresponds to the 95% confidence level.

Fisher wrote: "It is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not."

"Convenient." Not "optimal." Not "mathematically derived." Not "the only valid choice." Convenient.

The entire edifice of medical research — billions of dollars in drug development, regulatory decisions affecting millions of lives — rests on one man's judgment call about what was "convenient" in 1925.

There's nothing sacred about 95%. FDA sometimes uses 90% CI (bioequivalence studies). Some fields use 99%. The choice of confidence level is a decision about how much uncertainty you're willing to tolerate, not a law of nature.

The Regulatory Dimension

FDA and Confidence Intervals — Non-Negotiable

1. The CI Is the Decision Tool, Not the p-Value

FDA reviewers are trained to evaluate CIs, not just p-values. Internal FDA statistical review reports always discuss CI width and clinical relevance.

ICH E9 Section 5.5: "Confidence intervals are informative because they indicate the range within which the true treatment difference is likely to fall, and they indicate the precision of the treatment difference estimate. For some study designs, particularly equivalence or non-inferiority trials, the confidence interval approach may be more relevant than the hypothesis testing approach."

Translation: FDA considers CIs more informative than p-values, especially for equivalence/non-inferiority trials.

2. Bioequivalence — Where the CI IS the Decision

For generic drug approval, the entire decision rests on a CI:

The 90% CI for the geometric mean ratio of AUC and Cmax must fall entirely within 80-125%.

Not the p-value. Not the point estimate. The CI. If the point estimate is 100% (perfect equivalence) but the 90% CI is 78-122%, the drug FAILS — because the lower bound (78%) breaches the 80% threshold.

This means a drug can have a perfect average result and still fail because the CI is too wide (too few subjects, too much variability). The CI captures uncertainty. The point estimate doesn't. FDA trusts the CI.

3. Non-Inferiority Trials — The Margin Lives in the CI

Non-inferiority trials ask: "Is the new drug no worse than the old drug by more than Δ (the non-inferiority margin)?"

The decision rule: The lower bound of the 95% CI for the treatment difference must be above -Δ.

Scenario	Point Estimate	95% CI	NI Margin (Δ)	Decision
A	+2%	(-1% to +5%)	-3%	Lower bound (-1%) > -3% → Non-inferior ✓
B	+1%	(-4% to +6%)	-3%	Lower bound (-4%) < -3% → Failed ✗
C	-1%	(-2.5% to +0.5%)	-3%	Lower bound (-2.5%) > -3% → Non-inferior ✓

Scenario C is fascinating: The new drug is NUMERICALLY worse (point estimate -1%), but the CI proves it's not worse by more than the margin. This is a legitimate non-inferiority claim. Without understanding CIs, this result looks like the new drug is worse. With CIs, you see it's acceptably similar.

4. Subgroup Analyses — Forest Plots Are CI Plots

Every forest plot in a regulatory submission is a visual display of CIs across subgroups. FDA reviewers examine:

Do all subgroup CIs overlap with the overall CI? (Consistent effect)
Does any subgroup CI cross the null? (Potential inefficacy in that subgroup)
Are subgroup CIs very wide? (Underpowered subgroup, interpret with caution)

5. Safety Evaluation

For rare adverse events, the CI around the observed incidence rate tells FDA whether a safety signal is real or noise.

"Observed liver toxicity: 2.1% (95% CI: 0.5% to 5.8%)"

The true rate could be as low as 0.5% (acceptable) or as high as 5.8% (concerning). The point estimate (2.1%) looks fine. The upper bound of the CI (5.8%) keeps the safety flag raised. FDA uses the upper bound of the CI for safety, not the point estimate. Conservative approach — assume the worst end of the uncertainty range.

Branch-by-Branch — Where CI Bites You

General Medicine

The scenario: A trial of a new antihypertensive reports: "Mean BP reduction: 8 mmHg (95% CI: 1 to 15, p=0.03)."

A resident sees p=0.03 → "drug works!" → prescribes.

What the CI tells you that the p-value doesn't:

The effect could be as small as 1 mmHg (clinically worthless for BP)
The effect could be as large as 15 mmHg (excellent)
The CI is wide (14 mmHg span) → this study was probably underpowered
You're "significantly" uncertain about how much the drug actually does

A resident who reads only p-values thinks this is a positive trial. A resident who reads the CI thinks this is an inconclusive trial that happened to cross the significance threshold. Both are reading the same paper. Only one is reading it correctly.

Surgery

The scenario: "Robotic surgery reduces operative time by 12 minutes (95% CI: -3 to 27 minutes, p=0.11)."

The paper concludes: "No significant difference between robotic and conventional surgery."

What the CI tells you: The true difference could be 27 minutes in favour of robotic (substantial) or 3 minutes in favour of conventional (trivial). The study CANNOT rule out a clinically meaningful benefit of robotic surgery. This isn't a "negative" trial — it's an underpowered trial. The CI is too wide to conclude anything.

A hospital that decides "robotic surgery has no benefit" based on this study is making a decision from an uninformative CI, not from evidence of no effect.

"Absence of significance is not significance of absence." The CI makes this visible. The p-value hides it.

Paediatrics

The scenario: A vaccine efficacy trial in children: "Efficacy: 72% (95% CI: 18% to 91%)."

Point estimate looks great (72%). But the CI stretches from 18% (barely protective) to 91% (nearly perfect). With n=45 children, the estimate is wildly imprecise.

Should you recommend this vaccine? The point estimate says yes. The CI says "maybe, but we really don't know how well it works." This is the classic paediatric problem: ethical constraints → small samples → wide CIs → uncertain conclusions. Understanding CI width tells you whether a paediatric study actually answered its question or just provided a suggestive guess.

Obstetrics

The scenario: "Induction at 39 weeks reduces cesarean rate: OR 0.84 (95% CI: 0.76 to 0.93)."

This CI is narrow and entirely below 1.0 → strong evidence of reduced cesarean risk. The upper bound (0.93) still means at least a 7% relative reduction.

Contrast with a smaller study: "OR 0.78 (95% CI: 0.45 to 1.35)." Same direction, similar point estimate, but the CI crosses 1.0 (includes both benefit and harm). This study tells you almost nothing — the true effect could be 55% reduction or 35% increase. The CI reveals what the p-value ("not significant") obscures: the study was too small to detect anything.

Psychiatry

The scenario: "SSRI vs placebo on HAM-D: mean difference 2.1 points (95% CI: 1.3 to 2.9, p<0.001)."

Highly significant. Very narrow CI. Looks impressive.

But: The minimum clinically important difference on the HAM-D is approximately 3 points. The ENTIRE CI (1.3 to 2.9) falls BELOW the threshold for clinical importance.

This is a statistically significant, clinically irrelevant result. The CI proves it — the drug works, but it doesn't work enough to matter. Without the CI, you'd see p<0.001 and think "extremely effective." With the CI, you see "extremely precise evidence of a trivially small effect."

This is why NICE (UK) and some FDA reviewers look at CIs relative to clinical importance thresholds, not just relative to zero.

Community Medicine / PSM

The scenario: "Prevalence of hypertension in rural Maharashtra: 23% (95% CI: 18% to 28%)."

For policy planning, the CI tells you: budget for somewhere between 18% and 28% prevalence. If you build a programme for exactly 23%, you might underserve (if true prevalence is 28%) or waste resources (if true prevalence is 18%).

A smaller survey: "Prevalence: 23% (95% CI: 8% to 38%)." Same point estimate, useless CI. The true prevalence could be anywhere from 8% to 38%. You cannot plan anything from this.

The CI is the difference between a survey that informs policy and a survey that wastes money.

Orthopaedics

The scenario: Implant A vs Implant B: "Mean Harris Hip Score difference: 4 points (95% CI: -1 to 9)."

The CI crosses zero → not statistically significant. But the upper bound (9 points) exceeds the MCID of 8 points for Harris Hip Score. A clinically meaningful difference CANNOT be ruled out.

If the surgeon reads only the p-value ("not significant"), they dismiss Implant A. If they read the CI, they see: "We can't prove Implant A is better, but we also can't prove it isn't. The study was too small to tell." That's a very different conclusion — and might justify a larger trial instead of abandoning the implant.

Radiology / Pathology

The scenario: "Sensitivity of new MRI protocol for ACL tears: 92% (95% CI: 78% to 98%)."

The point estimate (92%) looks excellent. But the lower bound (78%) means the true sensitivity could be as low as 78% — which means 1 in 5 ACL tears missed. On a small sample (say, 35 patients with ACL tears), the CI is wide enough to include both "excellent" and "barely adequate."

A radiologist who reports "92% sensitivity" without the CI is giving false precision. The CI reveals that the true performance is uncertain and could be clinically inadequate.

CI vs p-Value — The Head-to-Head

Feature	p-value	95% CI
What it tells you	Probability of data (or more extreme) if H₀ true	Range of plausible values for the true effect
Direction of effect	Not directly (need to check estimate separately)	Yes — visible from the range
Magnitude of effect	No	Yes — the point estimate and bounds
Precision	No	Yes — width of CI = precision
Clinical significance	No — only statistical	Yes — compare bounds to MCID
Sample size adequacy	Hidden	Visible — wide CI = underpowered
Dichotomisation	Yes/No (significant or not)	Continuous — degrees of certainty

→ So the CI gives you everything the p-value gives you, PLUS the direction, magnitude, precision, and clinical relevance. The p-value is a subset of the information contained in the CI. This is why every major journal (NEJM, Lancet, BMJ, JAMA) and every regulatory body (FDA, EMA) requires CIs alongside or instead of p-values.

The 6 Ways Not Knowing CI Destroys You

1. You worship p < 0.05 and ignore what the CI is screaming

A study with p=0.04 and 95% CI: 0.1 to 25.0 mmHg is not a "positive study." It's a noisy mess that happened to cross a threshold. The CI tells you the effect could be anywhere from trivial (0.1) to enormous (25.0). You know nothing useful.

2. You dismiss non-significant results that have narrow CIs

"95% CI: -1 to 2 mmHg, p=0.48." This is NOT a failure. This is strong evidence that the drug has little to no effect (the entire CI is clinically tiny). This is a useful result — it rules things OUT. But if you only read "p=0.48 → negative study," you miss the message.

3. You can't evaluate non-inferiority or equivalence trials

These trials are ENTIRELY decided by CI position relative to a margin. No CI understanding → you literally cannot read the primary result of a non-inferiority trial. And non-inferiority designs are the fastest-growing trial type in medicine.

4. You confuse precision with accuracy

A narrow CI means the estimate is precise (low uncertainty). It does NOT mean the estimate is accurate (close to truth). A biased study can produce a narrow CI that's precisely wrong — like a miscalibrated thermometer that consistently reads 2°C too high with very little variation. The CI captures random error, not systematic error (bias).

5. You can't counsel patients about treatment uncertainty

Patient: "Doctor, will this drug help me?"

Bad answer: "The study was statistically significant, so yes."

Better answer: "The study suggests the drug lowers your BP by about 8 points, but it could be as little as 2 or as much as 14. Even at the low end, some benefit is likely."

The CI gives you language for communicating uncertainty honestly. Without it, you're either falsely certain or unhelpfully vague.

6. You can't design your thesis sample size

The sample size formula for a CI-based approach is:

n = (Z × SD / E)²

Where E = desired CI half-width (margin of error). If you want a CI no wider than ±5 mmHg for BP with SD=15, you need n = (1.96 × 15 / 5)² = 35.

If you don't understand what a CI width represents, you can't use this formula, and your thesis sample size is either unjustified or copy-pasted from another study without understanding.

The One Thing to Remember

A p-value is a verdict: guilty or not guilty. A confidence interval is the case file — it shows you all the plausible truths, how certain you are, and how much room there is for doubt.

Every time you see a point estimate in a paper, your first question should be: "What's the confidence interval?" Because the point estimate tells you the best guess. The CI tells you how good that guess actually is.

A mean of 12 with a CI of 10-14 is a solid finding. A mean of 12 with a CI of -5 to 29 is a shrug dressed up as a number. The point estimate is identical. The CI separates knowledge from noise.

Neyman called it "confidence" because he wanted you to trust the method, not the specific result. But the practical lesson is simpler: the CI is the most honest number in the paper. It doesn't hide uncertainty behind a binary yes/no. It shows you exactly how much you know and how much you don't.

Read the interval. Always.