Why Do You Start Every Experiment by Assuming Your Drug Doesn't Work?
The Problem First
You're a pulmonology resident. You've spent two years collecting data for your thesis: a new inhaled corticosteroid vs the standard one for moderate asthma. You measured FEV1 improvement at 12 weeks. Your drug showed 320 mL improvement. The standard showed 280 mL. A 40 mL difference.
You're excited. Your drug is better.
Your biostatistics professor says: "Before you celebrate, you need to prove that this 40 mL difference isn't just random noise."
You say: "But I can SEE it's different. 320 is more than 280."
She says: "I can SEE shapes in clouds. That doesn't make them real. The statistical framework starts from the assumption that your drug does NOTHING different. You have to provide enough evidence to overturn that assumption."
You have to prove your drug works by first assuming it doesn't. That assumption — the assumption of nothing happening — is called the null hypothesis. And the entire logic of statistical testing, drug approval, and evidence-based medicine is built on this strange, counterintuitive starting point.
Why would anyone design a system that starts by assuming the thing you're testing doesn't work?
Word Surgery: "Null Hypothesis"
"Null"
Root: Latin nullus = "not any" / "none" / "zero" → From ne- (not) + ullus (any)
Literal meaning: "nothing" / "zero" / "no effect"
In statistics: The assumption that there is NO effect, NO difference, NO association. The treatment does nothing. The groups are the same. The correlation is zero. The drug is a fancy placebo.
Why "null" and not "zero" or "nothing"? Because "null" has a specific connotation of deliberate negation — it's not that we don't know whether there's an effect. We're actively ASSUMING there isn't one. "Null" implies a formal declaration of nothingness, not just ignorance.
"Hypothesis"
Root: Greek hypo- (under, below) + thesis (a placing, a proposition) → hypothesis = "something placed underneath" / "a foundation placed below"
Literal meaning: "the proposition you build your test on top of"
Why "hypo-" (under)? Because a hypothesis is the FOUNDATION underneath your experiment. You set it down first, then build the test on top of it. You place it UNDER the evidence and see if the evidence crushes it or it holds up.
→ So "null hypothesis" literally = "the foundation of nothingness that you place underneath your experiment."
→ Aha: You build a floor of "nothing is happening." Then you pile evidence on top. If the evidence is heavy enough, the floor cracks. If the floor holds, you haven't proven anything is happening.
Naming Family
| Term | Symbol | What It Means | The Name Logic |
|---|---|---|---|
| Null Hypothesis | H0 | Nothing is happening | "Null" = zero effect. The default. |
| Alternative Hypothesis | H1 or Ha | Something IS happening | "Alternative" = the OTHER possibility, the one you actually believe |
| Research Hypothesis | — | What the researcher hopes to show | Often identical to H1, but conceptual not formal |
| One-sided (one-tailed) | H1: μ₁ > μ₂ | Effect is in a specific direction | "One side" of the distribution |
| Two-sided (two-tailed) | H1: μ₁ ≠ μ₂ | Effect could be in either direction | "Both sides" of the distribution |
The confusing part: The null hypothesis is usually what you DON'T believe. The alternative hypothesis is usually what you DO believe. You test the thing you don't believe in order to prove the thing you do believe. It's like proving you're not a criminal instead of proving you're a good person. The system is built on DISPROVING THE NEGATIVE, not on proving the positive.
Who Invented This? — The Neyman-Pearson-Fisher War
The null hypothesis has one of the most contentious origin stories in science. Three giants fought over it, and we ended up with a Frankenstein hybrid that none of them would fully endorse.
Ronald Fisher (1890-1962) — The First Formulation
Fisher introduced the concept of the null hypothesis in the 1920s. His framework was:
- State a null hypothesis (H0): "There is no effect"
- Calculate a test statistic from your data
- Find the p-value: the probability of getting data this extreme (or more extreme) IF H0 is true
- If the p-value is small enough, reject H0
Fisher's key idea: The null hypothesis is a straw man — you set it up to knock it down. You never "accept" H0. You either reject it or "fail to reject" it. The evidence either disproves H0 or is insufficient to disprove it.
Fisher did NOT use a fixed α threshold. He reported exact p-values and let the researcher judge: p=0.001 is stronger evidence against H0 than p=0.04. He considered α=0.05 "convenient" but not sacred.
Fisher did NOT use an alternative hypothesis. In his framework, you only specify H0. If you reject it, you conclude "something is happening" without formally specifying what that "something" is.
Jerzy Neyman (1894-1981) and Egon Pearson (1895-1980) — The Rival Framework
Neyman and Pearson (Egon was the son of Karl Pearson who coined "standard deviation") developed a competing framework in the 1930s:
- State TWO hypotheses: H0 (null) AND H1 (alternative)
- Choose α (Type I error rate) BEFORE collecting data
- Choose β (Type II error rate) and calculate required sample size
- Collect data, compute test statistic
- If test statistic exceeds the critical value → reject H0 in favour of H1
- If not → fail to reject H0 (NOT "accept H0")
Neyman-Pearson's key difference: They introduced the ALTERNATIVE hypothesis and the concept of TYPE I and TYPE II errors. Fisher only had H0. Neyman-Pearson had both H0 and H1, and explicitly balanced the risk of wrongly rejecting H0 (Type I) against the risk of wrongly failing to reject it (Type II).
Neyman-Pearson also introduced POWER (1 - β) — the probability of correctly rejecting H0 when H1 is true. Fisher never used the concept of power.
The War
Fisher HATED Neyman-Pearson's framework. He called it "childish" and "horrifying." His objections:
- Fixed α is stupid. Fisher believed p-values should be interpreted in context, not compared against a rigid threshold. A p=0.049 should not be treated fundamentally differently from p=0.051.
- The alternative hypothesis is unnecessary. Fisher believed you should only specify what you're trying to disprove (H0), not what you're trying to prove (H1). Specifying H1 biases the experiment.
- Decision-making is not the goal of science. Fisher saw his framework as measuring EVIDENCE AGAINST H0. Neyman-Pearson saw theirs as a DECISION PROCEDURE (accept or reject). Fisher thought reducing science to binary decisions was reductive.
What We Actually Use — The Frankenstein Hybrid
Modern medical statistics is a bastardised hybrid of both frameworks that neither Fisher nor Neyman-Pearson would recognise:
| Feature | Fisher | Neyman-Pearson | What We Actually Do |
|---|---|---|---|
| Hypotheses | H0 only | H0 and H1 | H0 and H1 |
| α | Not fixed, p-value is continuous | Fixed before data collection | "Fixed" at 0.05 but also report exact p-values |
| p-value | Continuous measure of evidence | Compared to α, binary decision | Both — report p-value AND compare to α=0.05 |
| Power | Not used | Central to design | Used for sample size but often ignored in interpretation |
| Language | "The evidence against H0 is strong/weak" | "Reject H0 / Fail to reject H0" | "Statistically significant" (a phrase neither invented) |
Nobody intended the system we have. Fisher would be appalled that we use fixed α=0.05 as a universal threshold. Neyman would be appalled that we interpret p-values as evidence strength. Pearson would be appalled at the entire mess.
But it works well enough for regulatory decision-making. The hybrid gives us a framework for controlling error rates (Neyman-Pearson's contribution) while also quantifying evidence strength (Fisher's contribution). The philosophical inconsistency bothers statisticians. The practical utility serves medicine.
Why Is "Null" Confusing? — The Dictionary Collision
| Source | "Null" means... |
|---|---|
| Legal | "Null and void" = invalid, having no legal force |
| Computing | Null = empty, no value assigned, undefined |
| Everyday | "Null result" = a failed experiment |
| Statistics | Null = the assumption of ZERO effect (not "failed" or "invalid") |
The Three Confusions
Confusion 1: "Null" ≠ "failed"
When a study "fails to reject the null hypothesis," students think the study "failed." It didn't. It produced a valid result — the result is that there wasn't enough evidence against H0. That's information, not failure.
"We found no significant difference" is a result, not a failure. But the word "null" makes it FEEL like failure because "null" in everyday language means "nothing, void, worthless."
Confusion 2: "Fail to reject H0" ≠ "Accept H0"
The double negative is agonising. "Fail to reject" is not the same as "accept." Not being able to disprove innocence is not the same as proving innocence.
Why don't we just say "accept H0"? Because failure to find evidence against H0 doesn't mean H0 is true. It might mean:
- The effect is real but your sample was too small (underpowered)
- The effect is real but your measurement was too imprecise
- The effect is real but in a different direction than you tested (one-tailed vs two-tailed)
"Fail to reject" preserves the ambiguity. "Accept" falsely implies certainty.
Confusion 3: H0 is what you DON'T believe
In every other domain, you state what you believe and try to prove it. In court, the prosecution states their case and tries to prove it.
In statistics, you state what you DON'T believe (H0) and try to disprove it. It's like a prosecutor saying: "Let me assume the defendant is innocent, and then show you the evidence is so overwhelming that this assumption is untenable."
This is proof by contradiction — the same logic used in mathematical proofs. Assume the opposite of what you want to show. Demonstrate that the assumption leads to an absurd conclusion (p < 0.05 = "if H0 were true, getting data this extreme would be absurdly unlikely"). Therefore, the assumption must be wrong. Therefore, H0 is probably false.
It's rigorous. It's powerful. It's deeply unintuitive for anyone who hasn't done mathematical logic.
Why Start with "Nothing Is Happening"? — The Deep Reason
The Courtroom Analogy
The null hypothesis mirrors the legal principle of presumption of innocence:
| Legal System | Statistical System |
|---|---|
| Defendant is presumed innocent | Drug is presumed ineffective (H0) |
| Prosecution must prove guilt | Researcher must prove efficacy |
| Beyond reasonable doubt | p < 0.05 (or the chosen α) |
| Guilty verdict | Reject H0 |
| Not guilty verdict | Fail to reject H0 |
| Not guilty ≠ innocent | Fail to reject ≠ H0 is true |
Why presume innocence? Because the consequences of wrongly convicting an innocent person (Type I error) are considered worse than wrongly acquitting a guilty person (Type II error). The system is deliberately conservative — it would rather let 10 guilty people go free than convict 1 innocent person.
Why presume no drug effect? Because the consequences of approving an ineffective (or harmful) drug (Type I error) are considered worse than failing to approve an effective drug (Type II error). The system would rather reject 10 effective drugs than approve 1 harmful one.
This asymmetry is a moral choice, not a mathematical one. Society decided that protecting the public from harm is more important than maximising access to treatments. H0 embodies that choice.
The Parsimony Principle
There's a deeper philosophical reason: Occam's Razor.
"Do not multiply entities beyond necessity." The simplest explanation should be preferred until evidence demands a more complex one.
H0 is the simplest explanation: nothing is happening. The groups are the same. The treatment has no effect. The correlation is zero. Any observed difference is just random noise.
H1 is the more complex explanation: something IS happening. A real effect exists.
Starting with H0 means requiring evidence before accepting complexity. You don't get to claim your drug works just because your sample showed a difference. You have to show that the difference is too large to be plausibly explained by chance alone (H0).
The Regulatory Dimension
FDA and the Null Hypothesis — Structural Conservatism
The entire FDA approval process is built on the null hypothesis framework.
The sponsor (pharmaceutical company) must provide "substantial evidence of effectiveness" (Federal Food, Drug, and Cosmetic Act, Section 505). This means: the sponsor must disprove H0, not merely suggest H1.
1. Pre-specification of H0 and H1
ICH E9 requires that the null and alternative hypotheses be pre-specified in the Statistical Analysis Plan (SAP) before the study is unblinded.
Why? Because if you specify H0 AFTER seeing the data, you can tailor it to get a significant result (p-hacking). Pre-specification ensures that the hypothesis was not influenced by the data.
Real example of this going wrong: A trial tests 5 different dose groups against placebo. After unblinding, only the 200mg dose "works." The sponsor rewrites the SAP to specify H0 as "200mg = placebo" and claims a pre-specified positive result. FDA catches this because the original SAP specified H0 as "any dose = placebo" with multiplicity correction.
2. The Two-Trial Rule
FDA traditionally requires TWO adequate and well-controlled pivotal trials, each rejecting H0 at α=0.05 (two-sided).
The mathematical logic: if each trial independently rejects H0 at α=0.05, the probability of BOTH being false positives (both wrongly rejecting H0) = 0.05 × 0.05 = 0.0025 (1 in 400).
One trial at α=0.05 → 1-in-20 chance of false positive. Two trials → 1-in-400 chance. That's the statistical logic behind the two-trial requirement.
Exceptions exist: For rare diseases, serious conditions with unmet need, or when a single trial is extraordinarily persuasive (massive effect size, very low p-value, consistent secondary endpoints), FDA may accept one pivotal trial. But the DEFAULT is two — because the null hypothesis framework demands replication to protect against false positives.
3. One-Sided vs Two-Sided H1
Word Surgery: "One-Sided" and "Two-Sided"
Why these names? Because they describe which SIDE(S) of the distribution you're examining.
A two-sided test asks: "Is the drug different from placebo?" (could be better OR worse) → H0: Drug = Placebo. H1: Drug ≠ Placebo. → You reject H0 if the difference falls in EITHER tail (either side).
A one-sided test asks: "Is the drug better than placebo?" (only better, not worse) → H0: Drug ≤ Placebo. H1: Drug > Placebo. → You reject H0 only if the difference falls in ONE tail.
FDA's position: Almost always requires two-sided tests at α=0.05 (equivalent to one-sided at α=0.025). Why? Because a one-sided test ASSUMES the drug can't be harmful — which is medically dangerous. A drug might make things worse, and a one-sided test would never detect it.
The exception: Non-inferiority trials use one-sided testing by convention (you're only asking "is the new drug no worse than the standard?").
4. Composite Null Hypotheses in Oncology
Modern oncology trials often have composite H0:
"H0: The drug does not improve OS AND does not improve PFS" (two co-primary endpoints).
The alpha is split: test PFS at α=0.025, test OS at α=0.025 (total α=0.05). Or use hierarchical testing: test PFS first at full α=0.05, and only if significant, test OS at α=0.05.
The H0 structure determines the multiplicity correction strategy, which determines whether the results are confirmatory or exploratory. Getting the null hypothesis wrong at the design stage can invalidate the entire trial.
5. Futility Analysis — When H0 Wins Early
During interim analyses, a Data Safety Monitoring Board (DSMB) may stop a trial for futility — when the data are so consistent with H0 that continuing the trial would be pointless (no realistic chance of rejecting H0 by the end).
This is formalized as conditional power under H0: "If H0 is true, what is the probability of getting a significant result by the end of the trial?" If conditional power is <10%, the trial is futile.
Futility stopping saves patients from exposure to an ineffective drug and saves sponsors millions in trial costs. It's the null hypothesis doing useful work — not as a target to disprove, but as a realistic assessment that the drug genuinely might not work.
Branch-by-Branch — Where the Null Hypothesis Bites You
General Medicine
The scenario: You run a study comparing two antihypertensives. Mean BP reduction: Drug A = 14 mmHg, Drug B = 12 mmHg. p=0.35.
Your conclusion: "No significant difference. Both drugs are equally effective."
What you actually showed: You failed to reject H0 (Drug A = Drug B). But you did NOT prove they're equal. The 2 mmHg difference might be real — your study might simply have been too small to detect it.
The correct conclusion: "We found insufficient evidence to conclude that Drug A and Drug B differ in BP reduction. The observed 2 mmHg difference could be due to chance (p=0.35), but our study may have been underpowered to detect a difference of this magnitude."
The H0 trap: "Not significant" ≠ "equal." "Fail to reject H0" ≠ "H0 is true." This single misinterpretation fills journals with false claims of equivalence from underpowered studies.
Surgery
The scenario: A trial of a new surgical technique: "No significant difference in complication rates (p=0.22)."
The paper recommends: "Both techniques are equivalent. Use whichever is convenient."
The H0 trap: The study had n=40 per group. Power to detect a clinically meaningful difference (5% vs 15% complication rate) at α=0.05 → power = 35%. The study was DESIGNED to fail. With 35% power, you had only a 35% chance of finding a real difference even if it existed.
"Failing to reject H0 when your power is 35% is like failing to find a burglar when you only searched one room of the house." You didn't prove the house is empty. You barely looked.
What an equivalence claim actually requires: A properly designed EQUIVALENCE or NON-INFERIORITY trial with pre-specified margins, adequate power (≥80%), and demonstration that the CI falls within the equivalence margin. That's MUCH harder than just "failing to find a difference" in an underpowered superiority trial.
Paediatrics
The scenario: A study of a new paediatric antibiotic: "Primary endpoint not met (p=0.08). However, significant improvement seen in the subgroup of children aged 2-5 (p=0.03)."
The H0 trap: The primary H0 was not rejected. The subgroup finding has no formal Type I error protection because the gate (primary endpoint significance) wasn't passed. In a hierarchical testing framework, p=0.03 in a subgroup has no confirmatory value if the primary H0 stands.
The paper reports the subgroup as a "positive finding." It's an exploratory observation dressed up as a conclusion. The null hypothesis framework says: the primary H0 stood. Everything downstream is hypothesis-generating, not hypothesis-confirming.
Obstetrics
The scenario: A trial of progesterone for preventing preterm birth: "No significant reduction in preterm birth rate (p=0.07)."
One camp says: "p=0.07 is close! The drug probably works." Other camp says: "Didn't reach significance. H0 stands."
What Fisher would say: p=0.07 is moderate evidence against H0. Not convincing, but suggestive. Worth pursuing in a larger trial.
What Neyman-Pearson would say: You pre-specified α=0.05. p=0.07 > 0.05. Decision: fail to reject H0. Done.
What the hybrid system does: Reports p=0.07 as "not significant" (Neyman-Pearson) but also notes it's "trending toward significance" or "borderline" (Fisher-ish). This is where the Frankenstein hybrid creates clinical confusion.
The practical lesson: p=0.07 means the data are somewhat unlikely under H0 but not unlikely ENOUGH to meet the pre-specified threshold. Whether to pursue a confirmatory trial depends on clinical context (severity of prematurity, unmet need, safety profile), not on whether 0.07 is "close to" 0.05.
Psychiatry
The scenario: An antidepressant trial. H0: Drug = Placebo on HAM-D change.
The placebo group improves by 8 points. The drug group improves by 10 points. Difference: 2 points. p=0.04.
H0 is rejected! But is the result meaningful?
The H0 trap: The null hypothesis only asks "is the difference ZERO?" It does not ask "is the difference CLINICALLY IMPORTANT?" A statistically significant 2-point difference on a 52-point scale (MCID ≈ 3 points) means: we've proven the drug does SOMETHING — but that something is clinically trivial.
Rejecting H0 proves that the drug is not identical to placebo. It does NOT prove the drug is useful. H0 is a LOW BAR. Beating it means you're better than nothing. Being better than nothing is necessary but not sufficient for clinical value.
This is why FDA's psychiatric drug reviews increasingly consider clinical significance (response rates, remission rates, effect sizes) alongside statistical significance (p-values, H0 rejection).
Community Medicine / PSM
The scenario: An ecological study: "Significant negative correlation between district literacy rate and infant mortality (r = -0.45, p=0.01)."
H0: r = 0 (no correlation). H0 rejected. p=0.01.
The H0 trap: Rejecting H0 means the correlation is not zero. It does NOT mean:
- Literacy CAUSES lower infant mortality (could be confounded by income, healthcare access, sanitation)
- Increasing literacy WILL reduce infant mortality (ecological correlation ≠ individual causation)
- The relationship is strong (r² = 0.20, so literacy explains only 20% of mortality variation)
The null hypothesis framework is a filter for noise. It is not a filter for confounding, causation, or clinical relevance. Passing through H0 (p < 0.05) gets you through ONE gate. There are many more gates (confounding, bias, effect size, external validity) that H0 doesn't guard.
Orthopaedics
The scenario: Implant A vs Implant B. "No significant difference in 10-year revision rates (p=0.12)."
Registry data, n=500 per implant. The study concludes: "Both implants have equivalent long-term performance."
The H0 trap: Revision rates were 4% (Implant A) vs 7% (Implant B). A 3% absolute difference. In a patient population of 100,000 TKRs per year, that's 3,000 extra revisions per year — each costing ₹3-5 lakh and involving significant morbidity.
The study "failed to reject H0" not because the implants are equivalent, but because n=500 per group provided insufficient power to detect a 3% absolute difference in revision rates (power ≈ 45%).
3,000 extra revisions per year were hidden behind a "non-significant p-value" because the study was too small. The null hypothesis wasn't proven true — it simply wasn't disproven. And the clinical consequence of that distinction is enormous.
The 6 Ways Not Knowing the Null Hypothesis Destroys You
1. You treat "not significant" as "no effect"
This is the cardinal sin. "Fail to reject H0" means the EVIDENCE was insufficient. It does NOT mean H0 is true. An underpowered study will "fail to reject H0" even when the drug genuinely works. The study failed, not the drug.
The fix: Always ask: "What was the power?" If power was <80%, a non-significant result tells you nothing — the study wasn't designed to detect the effect.
2. You treat "significant" as "important"
Rejecting H0 means the effect is not zero. It doesn't mean the effect is large, clinically meaningful, or worth acting on. A p=0.001 for a 0.5 mmHg blood pressure difference is statistically significant and clinically useless.
The fix: Always look at the effect size and confidence interval, not just the p-value. H0 rejection is necessary but not sufficient for clinical relevance.
3. You can't formulate your thesis hypothesis correctly
Your examiner asks: "State your null and alternative hypotheses."
Wrong: "H0: The new drug is better than the old drug." Wrong: "H0: There is a significant difference."
Right: "H0: There is no difference in FEV1 improvement between Drug A and Drug B at 12 weeks. H1: There is a difference in FEV1 improvement between Drug A and Drug B at 12 weeks."
H0 is always the boring one. The nothing-happening one. The default. Students chronically confuse H0 and H1, or state H0 as what they hope to find. If your H0 is what you WANT to be true, you've flipped the framework upside down.
4. You don't understand why two trials are needed for drug approval
The two-trial requirement is Type I error multiplication: α × α = 0.05 × 0.05 = 0.0025. Both trials independently reject H0 → combined false positive rate = 0.25%.
If you don't understand H0, you don't understand why this multiplication works, why single-trial approvals carry higher risk, or why FDA sometimes accepts one trial (when the effect is so large that the single-trial p-value is itself vanishingly small).
5. You can't distinguish exploratory from confirmatory research
Confirmatory: H0 and H1 stated before data collection. Analysis plan pre-specified. α controlled. Exploratory: No pre-specified H0. Data-driven. Multiple comparisons without correction.
A confirmatory rejection of H0 is evidence. An exploratory "significant finding" is a hypothesis for the next study. Confusing the two fills the medical literature with unreproducible results.
6. You can't participate in the biggest debate in modern statistics
The 2019 ASA statement ("Moving to a world beyond p < 0.05"), signed by 800+ statisticians, called for the end of the mechanical H0 rejection framework. The debate — should we abolish the null hypothesis significance testing (NHST) framework? — is the most consequential methodological debate in modern medicine.
Alternatives proposed: Bayesian methods, estimation (CIs without testing), effect sizes with uncertainty intervals, decision-theoretic frameworks.
If you don't understand H0, you can't understand why people want to reform it, what the alternatives are, or how this debate affects the future of evidence-based medicine. You're watching a revolution from the outside.
The One Thing to Remember
The null hypothesis is medicine's safety valve. It's the system saying: "Prove it."
You don't get to claim your drug works, your technique is better, or your risk factor causes disease just because your sample shows a difference. You have to show that the difference is too large to be explained by chance — by starting from the assumption that chance explains everything.
It's counterintuitive. It's frustrating. It occasionally blocks effective treatments (Type II errors). But it protects millions of patients from ineffective or harmful interventions that LOOKED promising in small, noisy samples.
The null hypothesis is not your enemy. It's not a bureaucratic hurdle. It's the mathematical embodiment of a moral principle: before we expose patients to a new treatment, we demand evidence that it actually works. That demand starts with H0.
H0 says: "Nothing is happening." Your data says: "Something might be." The p-value asks: "How surprised would we be to see this data if nothing were really happening?" And if the surprise is large enough — if the data are too extreme to be plausibly explained by H0 — the null falls, and evidence wins.
That's the logic. It's 90 years old. Fisher started it. Neyman and Pearson formalised it. The hybrid we use today is philosophically messy but practically effective. And every drug you prescribe, every guideline you follow, and every paper you read was built on this foundation of deliberate, systematic doubt.