What's Worse — Convicting an Innocent Man, or Letting a Killer Walk Free?

Stateazy Series

What's Worse — Convicting an Innocent Man, or Letting a Killer Walk Free?

The Problem First

You're a medicine resident. A patient comes with chest pain. You order a troponin.

Troponin is positive. You admit, start heparin, call cardiology. But the patient actually has myocarditis, not MI. You just subjected a 28-year-old to unnecessary anticoagulation and cath lab. You raised a false alarm.

Troponin is negative. You send the patient home with antacids. But the patient is actually having an NSTEMI with a troponin that hasn't peaked yet. They come back in 6 hours in cardiogenic shock. You missed the real thing.

Both are errors. But they're fundamentally different errors. And the entire machinery of clinical trials, drug approvals, diagnostic testing, and screening programmes is built on understanding the difference between them.

Before the Jargon — The Courtroom Analogy

A criminal trial is the perfect model. The court assumes:

The defendant is innocent until proven guilty.

This is the null hypothesis — the default assumption. Nothing is happening. The drug doesn't work. The patient is healthy. The defendant is innocent.

But wait — why is it called a "null hypothesis"? Let's crack it open.

Term Deconstruction: Null Hypothesis (H₀)

Word Surgery
Null (Latin nullus = "not any, none" → from ne- "not" + ullus "any") + Hypothesis (Greek hypo- "under" + thesis "placing, proposition")
Literal meaning → "a proposition of nothingness" — the placed-under assumption that nothing is happening

Why This Name?
Ronald Fisher formalised it in the 1920s-30s. He needed a starting position for statistical tests — the assumption you try to disprove. He called it "null" because it proposes that the effect is nil, zero, null. The treatment has NULL effect. There's NO difference. Nothing is going on. The word hypothesis means something "placed under" as a foundation — so the null hypothesis is the foundation of nothingness that you try to demolish with data.

The "Aha" Bridge
So "null hypothesis" literally = the proposition that nothing exists. It's the "boring answer." No drug effect. No difference. No correlation. Pure noise. Your entire study is an attempt to prove this boring answer wrong.

Naming Family
Null (nothing), nullify (to make nothing), annul (to reduce to nothing). Hypothesis: thesis (a proposition), antithesis (opposite proposition), synthesis (putting propositions together). Also: H₀ (the subscript zero = null/nothing).

Term Deconstruction: Alternative Hypothesis (H₁ or H_a)

Word Surgery
Alternat- (Latin alternare = "to do one thing then another" → from alter = "the other of two") + -ive (tending toward)
Literal meaning → "the other option" — the hypothesis that is the other of two possibilities

Why This Name?
Jerzy Neyman and Egon Pearson introduced this in the 1930s as the counterpart to Fisher's null. Fisher only had the null — you either reject it or don't. Neyman-Pearson said: you need to specify what the alternative is. If the drug doesn't have null effect, what effect DOES it have? The "alternative" is literally "the other one" — the hypothesis that something real is happening.

The "Aha" Bridge
So "alternative hypothesis" literally = the other proposition. If the null says "nothing is happening," the alternative says "something IS happening." The entire trial is a contest between these two. Your data picks the winner.

Naming Family
Alter (the other), alternative (the other option), alter ego (the other self), altercation (a dispute between two sides). One-sided alternative (effect in one direction) vs. two-sided (effect in either direction).

Now two things can go wrong:

	Reality: Innocent	Reality: Guilty
Verdict: Guilty	WRONG — Convicted an innocent person	CORRECT
Verdict: Not Guilty	CORRECT	WRONG — Let a criminal go free

Convicting an innocent person = Type I Error (False Positive) = You said something is there when it isn't
Letting a guilty person go free = Type II Error (False Negative) = You said nothing is there when something is

Term Deconstruction: Type I Error and Type II Error

Type I Error

Word Surgery
Type (Greek typos = "a blow, impression, model") + I (first) + Error (Latin errare = "to wander, stray")
Literal meaning → "the first kind of wandering from truth"

Why This Name?
Neyman and Pearson (1928-1933) needed to distinguish two fundamentally different ways a statistical test could go wrong. They simply numbered them. Type I = the FIRST kind of error = rejecting the null when it's actually true (false alarm). Type II = the SECOND kind = failing to reject the null when it's actually false (missed signal). There's no deep symbolism in "I" and "II" — it's just first and second.

The "Aha" Bridge
So "Type I error" = the first kind of mistake = you see a signal that isn't there. Think of it as being too eager. The alarm goes off when there's no fire. You convict an innocent man. You approve a useless drug. The 'I' in Type I = "I found something!" (but you didn't).

Naming Family
False positive (same thing, different name), alpha error (same thing, named by its probability), significance error. Mnemonic: I = false Inclusion, Impostor signal.

Type II Error

Word Surgery
Same roots as above. Type II = the second kind of wandering from truth.

Why This Name?
Same Neyman-Pearson framework. The second way you can be wrong: the null hypothesis IS false (there IS a real effect), but your test fails to detect it. You miss what's actually there.

The "Aha" Bridge
So "Type II error" = the second kind of mistake = you miss a signal that IS there. Think of it as being too lazy/blind. The fire is real but the alarm stays silent. The criminal walks free. An effective drug gets rejected. The 'II' = "I Ignored It."

Naming Family
False negative (same thing), beta error (named by its probability), miss. Mnemonic: II = failing II see, II = missed IInformation.

Now the Statistical Terms

	Reality: No Effect (H₀ true)	Reality: Real Effect (H₁ true)
Study says: Effect exists (reject H₀)	Type I Error (α) — false positive	Correct (True Positive)
Study says: No effect (fail to reject H₀)	Correct (True Negative)	Type II Error (β) — false negative

The Numbers You Must Know

Symbol	What It Is	Conventional Value	Plain English
α (alpha)	Probability of Type I error	0.05 (5%)	"I'll accept a 5% chance of claiming something works when it doesn't"
β (beta)	Probability of Type II error	0.20 (20%)	"I'll accept a 20% chance of missing a real effect"
1 - β	Power	0.80 (80%)	"I have an 80% chance of detecting a real effect if it exists"
p-value	Probability of getting this result (or more extreme) if H₀ is true	< 0.05 to reject H₀	"If the drug truly doesn't work, what's the chance I'd see results this good by luck?"

Let's deconstruct each one.

Term Deconstruction: Alpha (α)

Word Surgery
Alpha (Greek alpha = the first letter of the Greek alphabet, from Phoenician aleph = "ox")
Literal meaning → "the first" — it was the first error type defined

Why This Name?
When Neyman and Pearson formalised error rates, they assigned Greek letters in order. Alpha (α) = the probability of the first kind of error (Type I). Beta (β) = the probability of the second kind (Type II). That's it. First letter for first error. Second letter for second error. No deeper mystery.

The "Aha" Bridge
So "alpha" = first letter = probability of the FIRST error = your false alarm rate. When someone says "α = 0.05," they mean: "I'm willing to accept that 5 out of 100 times, I might raise a false alarm." It's the price you pay for being willing to declare something significant.

Naming Family
Alpha level, significance level (same thing — the threshold for "significant"), alpha error (= Type I error). Often confused with p-value (they're related but NOT the same).

Term Deconstruction: Beta (β)

Word Surgery
Beta (Greek = second letter, from Phoenician beth = "house")
Literal meaning → "the second"

Why This Name?
Second letter for the second error type. β = P(Type II error) = probability of missing a real effect.

The "Aha" Bridge
So "beta" = second letter = probability of the SECOND error = your miss rate. When β = 0.20, you'll miss a real effect 20% of the time. The complement (1 - β) is what you actually want — power.

Term Deconstruction: Power (1 - β)

Word Surgery
Power (Old French povoir = "to be able" → from Latin potere = "to be powerful, capable")
Literal meaning → "the ability/capability to do something"

Why This Name?
Neyman and Pearson chose it because power measures a test's capability to detect a real effect. A powerful test = one that can find what's actually there. A weak test = one that misses real effects. The word is intuitive: more power = better ability to catch the truth.

The "Aha" Bridge
So "power" literally = the ability to detect. A study with 80% power has an 80% ability to find a real effect if one exists. A study with 30% power is like a security camera with a foggy lens — the burglar might be right there, but the camera can't see well enough to catch them.

Naming Family
Statistical power, power analysis (calculating needed sample size), power curve (power as a function of effect size), underpowered (too weak to detect), overpowered (so large it detects trivial effects).

Term Deconstruction: p-value

Word Surgery
p (abbreviation for "probability") + value (Latin valere = "to be strong/worth")
Literal meaning → "the probability value" — a number expressing how probable something is

Why This Name?
Ronald Fisher popularised the p-value in the 1920s as a way to summarise the evidence against the null hypothesis in a single number. The "p" stands for probability. Specifically: the probability of observing data this extreme (or more extreme) IF the null hypothesis were true. Fisher never intended it to be a rigid cutoff — that came from Neyman-Pearson's decision-theoretic framework.

The "Aha" Bridge
So "p-value" literally = the probability-worth of your data under the assumption of nothingness. Low p-value = your data is very unlikely if nothing is happening → so maybe something IS happening. It's not the probability the drug works. It's not the probability you're right. It's the probability the data would look like this IF the drug did nothing. A subtle but critical distinction.

Naming Family
p (probability), significance testing (using p to declare significance), Fisher's exact test (one way to calculate p), Bayesian posterior probability (a very different beast often confused with p-value).

Term Deconstruction: Significance

Word Surgery
Signi- (Latin signum = "sign, signal, mark") + -ficance (from facere = "to make")
Literal meaning → "the quality of making a sign/signal" — being sign-making, signal-producing

Why This Name?
Fisher used "significant" to mean: the data is making a signal. It's sign-ificant — it produces a sign that something might be real. When p < 0.05, the result is "statistically significant" = the data is producing a signal strong enough to take seriously. But NOTE: Fisher meant this as a continuous measure of evidence strength, not a binary yes/no.

The "Aha" Bridge
So "significant" literally = making a signal. "Statistically significant" does NOT mean "important" or "clinically meaningful." It means the data is emitting a signal above the noise floor you set (α). A drug that lowers BP by 0.5 mmHg can be "statistically significant" with n=100,000 while being clinically useless.

Naming Family
Signal (same root), sign (a mark), insignificant (not making a signal), clinical significance (actually matters to patients — a completely different concept from statistical significance).

Notice the asymmetry. We tolerate a 20% miss rate (β) but only a 5% false alarm rate (α). Why? Because in medicine, approving a useless/harmful drug (Type I) is considered worse than failing to approve a useful one (Type II).

The system is conservative by design. It would rather let 100 effective drugs fail in trials than approve 1 harmful drug.

The Analogy That Makes It Stick

The Fire Alarm

Your hospital has a fire alarm system.

Type I Error (False Alarm): Alarm goes off. Everyone evacuates. ICU patients get shifted. Surgeries get cancelled. Turns out — someone burned toast in the pantry. Costly, disruptive, but nobody dies from the alarm itself.

Type II Error (Missed Fire): Real fire in the electrical room. Alarm doesn't go off. Smoke spreads. Patients on ventilators are trapped. People die because the system stayed silent.

Now here's the design question: Do you set the alarm to be more sensitive (catches every whiff of smoke, but triggers on toast) or more specific (only triggers on real fires, but might miss a slow-burning one)?

That's the α/β tradeoff. Every time you set α lower (fewer false alarms), you push β higher (more missed signals) — unless you increase your sample size (buy a better alarm system).

The Regulatory Dimension — How FDA Thinks About This

Type I Error = FDA's Nightmare

The FDA's primary mandate is protecting the public from harm. Approving a drug that doesn't work (or worse, harms) is their Type I error.

This is why:

α is set at 0.025 (one-sided) or 0.05 (two-sided) for pivotal trials — non-negotiable
Two adequate and well-controlled trials are typically required (reduces α² to 0.0025 — a 0.25% chance of both being false positives)
Multiplicity adjustment is mandatory — if you test 10 endpoints, the chance of at least one false positive at α=0.05 is 1 - (0.95)¹⁰ = 40%. FDA demands Bonferroni, Hochberg, gatekeeping, or other corrections
Pre-specification of primary endpoint and analysis plan is required — no fishing for significant results after the fact

Let's deconstruct the multiplicity terms:

Term Deconstruction: Multiplicity

Word Surgery
Multi- (Latin multus = "many") + -plic- (Latin plicare = "to fold") + -ity (state of)
Literal meaning → "the state of being many-folded" → having many layers/tests

Why This Name?
When you run multiple statistical tests, each one "folds in" another chance of a false positive. The problem is multiplicative — the error rates multiply across tests. Statisticians call this the "multiplicity problem" because the many-foldedness of testing inflates your overall false alarm rate beyond what α alone would suggest.

The "Aha" Bridge
So "multiplicity" literally = many folds. Each test you run is another fold of the napkin. By the time you've folded 20 times, the shape bears no resemblance to what you started with. Your α=0.05 has inflated to 0.64. The "significance" of any individual result has been diluted by the crowd.

Term Deconstruction: Bonferroni (Correction)

Word Surgery
Bonferroni — named after Carlo Emilio Bonferroni (1892-1960), Italian mathematician
Not a word to dissect — it's a surname. But the method is simple: divide α by the number of tests.

Why This Name?
Bonferroni published his inequality in 1936 (building on George Boole's work). The Bonferroni correction says: if you're running m tests and want the overall false alarm rate to stay at α, then each individual test must use α/m as its threshold. 20 tests at α=0.05 → each test uses 0.05/20 = 0.0025. Brutally conservative but mathematically airtight.

The "Aha" Bridge
So "Bonferroni correction" = dividing your significance budget equally across all tests. Think of α as a ₹100 budget. If you're buying 20 items (tests), each item can only cost ₹5. Overspend on one, and the whole budget is blown. The criticism: it's so conservative that it kills your power to detect real effects. It's like setting the fire alarm so high that nothing triggers it.

Naming Family
Bonferroni inequality (the mathematical principle), Holm-Bonferroni (a stepwise improvement), Sidak correction (similar but slightly less conservative). Also: Hochberg, Benjamini-Hochberg (FDR-based — controls false discovery rate instead of family-wise error rate).

Term Deconstruction: Family-Wise Error Rate (FWER)

Word Surgery
Family (Latin familia = "household") + Wise (Old English = "in the manner of") + Error Rate
Literal meaning → "error rate in the manner of the whole family of tests"

Why This Name?
A "family" of tests = all the hypotheses tested in one study (they belong to the same household). The FWER is the probability that at least ONE test in the family produces a false positive. When statisticians say "family-wise," they mean: we're measuring the error rate for the WHOLE family, not each individual test.

The "Aha" Bridge
So "family-wise error rate" literally = the chance that at least one member of the test-family commits a false alarm. If you have 5 kids (tests) and each has a 5% chance of breaking a window (false positive), the chance that AT LEAST ONE breaks a window is much higher than 5%. FWER controls the probability that the family as a whole behaves.

Naming Family
FWER (family-wise), FDR (false discovery rate — a less strict alternative that controls the PROPORTION of false positives among all positives), per-comparison error rate (each test alone, ignoring the family).

Real example — Alzheimer's: Aducanumab (Aduhelm) had one positive trial and one failed trial. Same drug, same design. FDA approved it anyway under accelerated pathway. The statistical community was furious — the Type I error protection of "two positive trials" was abandoned. The drug was later withdrawn from market.

Type II Error = The Patient Advocate's Nightmare

Failing to approve a drug that actually works means patients die waiting.

This is where "regulatory flexibility" enters:

Accelerated Approval allows drugs for serious conditions based on surrogate endpoints (faster answer, but higher Type II risk if surrogates don't predict clinical outcomes)
Breakthrough Therapy Designation provides intensive FDA guidance and rolling review
Single pivotal trial may suffice for rare diseases (because requiring two trials in a disease with 500 patients worldwide is mathematically impossible at 80% power)
Adaptive trial designs allow sample size re-estimation mid-trial if the effect is smaller than expected (reducing β without inflating α)

Real example — Rare diseases: Elevidys (DMD gene therapy) failed its primary endpoint in the EMBARK trial. But FDA granted traditional approval based on the totality of evidence — secondary endpoints, biomarker data, mechanism of action, and devastating unmet need. They accepted higher Type II error risk on the primary in exchange for the totality picture.

Real example — The other side: Relyvrio (ALS) got accelerated approval based on a small Phase 2. The confirmatory Phase 3 PHOENIX trial failed everything. Drug withdrawn. The initial approval was arguably a Type I error that the confirmatory trial corrected.

The ICH E9 Framework

The International Council for Harmonisation sets the global standard:

α ≤ 0.05 (two-sided) for confirmatory trials
Power ≥ 80% (β ≤ 0.20) — though 90% power is increasingly expected for pivotal trials
Sample size must be justified based on clinically meaningful effect size, not statistical convenience
Interim analyses must use alpha-spending functions (O'Brien-Fleming, Lan-DeMets) to control cumulative Type I error across multiple looks

Branch-by-Branch — Where This Bites You

General Medicine

Type I trap: A paper says "Vitamin D supplementation reduces all-cause mortality (p=0.04)." But they tested Vitamin D against 15 different outcomes. With 15 tests at α=0.05, the probability of at least one false positive is 54%. That "significant" result is probably a Type I error from multiple testing. You start prescribing Vitamin D to every patient based on statistical noise.

Type II trap: A small study (n=60) says "no significant difference between Drug A and Drug B for hypertension." But with n=60, the study had only 30% power to detect a 5 mmHg difference. It wasn't powered to find a real difference — absence of evidence got mistaken for evidence of absence. You dismiss Drug A when it might actually be superior.

Surgery

Type I trap: "Robot-assisted surgery has significantly fewer complications than open surgery (p=0.03)." The study compared 47 outcomes. After multiplicity correction, nothing is significant. But the abstract only reports the one that crossed p<0.05. You invest ₹3 crore in a robot based on a false positive.

Type II trap: "No significant difference in recurrence rates between mesh and suture repair for hernia (p=0.08)." n=80. Underpowered. The 8% vs 15% recurrence difference is clinically massive but statistically "non-significant" because the study was too small. You keep doing suture repair because the study "showed no difference."

Paediatrics

Type I trap: Paediatric trials are small. Small trials have noisy results. Noisy results sometimes cross p=0.05 by chance. A trial of 30 children showing "significant improvement" with a new antibiotic has a much higher real-world Type I error rate than the nominal 5%, because small samples produce unstable estimates.

Type II trap: Conversely, a trial of 25 children "failing to show benefit" of a potentially life-saving therapy may simply have been too small. Ethical constraints prevent larger paediatric trials, but the consequence is that effective paediatric treatments get abandoned because we couldn't prove they work with tiny samples.

Obstetrics

Type I trap: Subgroup analysis. The main trial shows no benefit of intervention X. But in the subgroup of "nulliparous women aged 25-30 with BMI < 25," p=0.03. This is almost certainly a Type I error — when you slice data into enough subgroups, something will be significant by chance. Yet this subgroup result gets cited in guidelines.

Type II trap: A trial of magnesium sulphate for fetal neuroprotection had borderline results for years across small trials. Each individual trial was underpowered (Type II error). It took a meta-analysis combining all of them to finally confirm the benefit. Years of missed neuroprotection because individual trials kept "failing."

Psychiatry

Type I trap: Psychiatric rating scales (HAM-D, PANSS) have many subscales. A drug fails the total score but is "significant" on the "sleep quality" subscale. This gets marketed as "improves sleep in depression." That's a Type I error dressed up as a clinical finding.

Type II trap: Psychiatric drug effects are small (Cohen's d = 0.3-0.5). To detect these reliably, you need n=300+. Many trials recruit 100 and "fail" — not because the drug doesn't work, but because the study was underpowered. Real treatments get killed by inadequate sample sizes.

Community Medicine / PSM

Type I trap: A district health survey finds "significant association between mobile phone use and headaches (p=0.04)" in a dataset of 50,000 people with 200 variables. With that many comparisons, dozens of spurious associations will be "significant." A public health scare gets born from a Type I error.

Type II trap: A village-level intervention to reduce malaria (bed nets + IRS) is tested in 4 villages (cluster RCT). n=4 clusters. Power ≈ 20%. The study "fails to show benefit." The intervention gets defunded. But the study was designed to fail — 4 clusters can't detect anything. Real public health benefit gets buried by underpowered design.

Orthopaedics

Type I trap: "New implant shows significantly better functional scores at 6 months (p=0.04)." But follow up was at 6 months, 1 year, 2 years, and 5 years — four time points. No multiplicity correction. The 6-month result was the only one that crossed p<0.05. You adopt an expensive implant based on what's likely random noise at one cherry-picked timepoint.

Type II trap: A study comparing two fixation methods for distal radius fractures (n=40 per group) finds "no significant difference." But 40 per group gives you ~35% power for the clinically meaningful difference you care about. The study was doomed to find "no difference" regardless of reality. A truly better fixation method gets dismissed.

The 5 Ways Not Knowing This Destroys You

1. You worship p < 0.05 without asking "how many tests did they run?"

If a study tested 20 outcomes, the chance of at least one being "significant" at α=0.05 is:

1 - (0.95)²⁰ = 64%

That's worse than a coin flip. Without multiplicity correction, "statistical significance" in a multi-outcome study is nearly meaningless.

2. You treat "no significant difference" as "no difference"

This is the single most common statistical error in medicine. A study that fails to reject H₀ has NOT proven the groups are equal. It may simply have been too small (underpowered = high β = high Type II error).

"Absence of evidence is not evidence of absence." — Carl Sagan (and every biostatistician ever)

3. You can't evaluate sample size justifications

Every paper should state: "We calculated that n=X patients per group would provide 80% power to detect a difference of Y at α=0.05."

If the paper doesn't have this → they didn't plan the study properly. If the detected effect is smaller than Y → the study was underpowered for what they found. If the paper recruited far fewer than n=X → the power is below 80% and Type II error risk is unacceptable.

4. You don't understand why regulatory bodies demand two positive trials

Two independent trials each at α=0.05 gives a combined Type I error of 0.05 × 0.05 = 0.0025. That's a 1-in-400 chance of approving a drug that doesn't work based on two fluke results.

One trial at α=0.05? That's 1-in-20. Unacceptable for a drug that millions will take.

This is why the FDA requires two adequate and well-controlled trials — it's not bureaucratic redundancy, it's Type I error multiplication.

5. You can't participate in thesis design or journal club meaningfully

When your professor asks "why did you choose n=50?" and you say "because that's what the previous study used" — you've just admitted you don't understand power analysis, Type II error, or effect size estimation.

When a journal club paper reports p=0.07 and the presenter says "the study was negative," you should be the one asking: "What was the power? What effect size was it designed to detect? Is this a true negative or a Type II error from underpowering?"

The One Thing to Remember

Every time you read a trial result, you're standing at the courtroom analogy:

p < 0.05 → "Guilty verdict." But ask: could this be a wrongful conviction? (How many tests? Pre-specified? Adequate multiplicity control?)

p ≥ 0.05 → "Not guilty verdict." But ask: could the criminal have walked free? (Was the sample large enough? Was power adequate? Was the effect size realistic?)

A p-value is not a verdict. It's the strength of the prosecution's case. And even strong cases can convict the innocent, while weak cases can free the guilty.

The resident who understands this reads papers differently. Designs theses differently. Treats patients differently. Because they know that every "significant" result might be a false alarm, and every "negative" study might be a missed signal.

What's Worse — Convicting an Innocent Man, or Letting a Killer Walk Free?