Why Do We Start by Assuming Our Own Research Is Wrong?

Stateazy Series

Why Do You Start Every Experiment by Assuming Your Drug Doesn't Work?

The Problem First

You're a pulmonology resident. You've spent two years collecting data for your thesis: a new inhaled corticosteroid vs the standard one for moderate asthma. You measured FEV1 improvement at 12 weeks. Your drug showed 320 mL improvement. The standard showed 280 mL. A 40 mL difference.

You're excited. Your drug is better.

Your biostatistics professor says: "Before you celebrate, you need to prove that this 40 mL difference isn't just random noise."

You say: "But I can SEE it's different. 320 is more than 280."

She says: "I can SEE shapes in clouds. That doesn't make them real. The statistical framework starts from the assumption that your drug does NOTHING different. You have to provide enough evidence to overturn that assumption."

You have to prove your drug works by first assuming it doesn't. That assumption — the assumption of nothing happening — is called the null hypothesis. And the entire logic of statistical testing, drug approval, and evidence-based medicine is built on this strange, counterintuitive starting point.

Why would anyone design a system that starts by assuming the thing you're testing doesn't work?

Word Surgery: "Null Hypothesis"

"Null"

Root: Latin nullus = "not any" / "none" / "zero" → From ne- (not) + ullus (any)

Literal meaning: "nothing" / "zero" / "no effect"

In statistics: The assumption that there is NO effect, NO difference, NO association. The treatment does nothing. The groups are the same. The correlation is zero. The drug is a fancy placebo.

Why "null" and not "zero" or "nothing"? Because "null" has a specific connotation of deliberate negation — it's not that we don't know whether there's an effect. We're actively ASSUMING there isn't one. "Null" implies a formal declaration of nothingness, not just ignorance.

"Hypothesis"

Root: Greek hypo- (under, below) + thesis (a placing, a proposition) → hypothesis = "something placed underneath" / "a foundation placed below"

Literal meaning: "the proposition you build your test on top of"

Why "hypo-" (under)? Because a hypothesis is the FOUNDATION underneath your experiment. You set it down first, then build the test on top of it. You place it UNDER the evidence and see if the evidence crushes it or it holds up.

→ So "null hypothesis" literally = "the foundation of nothingness that you place underneath your experiment."

→ Aha: You build a floor of "nothing is happening." Then you pile evidence on top. If the evidence is heavy enough, the floor cracks. If the floor holds, you haven't proven anything is happening.

Naming Family

Term	Symbol	What It Means	The Name Logic
Null Hypothesis	H₀	Nothing is happening	"Null" = zero effect. The default.
Alternative Hypothesis	H₁ or H_a	Something IS happening	"Alternative" = the OTHER possibility, the one you actually believe
Research Hypothesis	—	What the researcher hopes to show	Often identical to H₁, but conceptual not formal
One-sided (one-tailed)	H₁: μ₁ > μ₂	Effect is in a specific direction	"One side" of the distribution
Two-sided (two-tailed)	H₁: μ₁ ≠ μ₂	Effect could be in either direction	"Both sides" of the distribution

The confusing part: The null hypothesis is usually what you DON'T believe. The alternative hypothesis is usually what you DO believe. You test the thing you don't believe in order to prove the thing you do believe. It's like proving you're not a criminal instead of proving you're a good person. The system is built on DISPROVING THE NEGATIVE, not on proving the positive.

Who Invented This? — The Neyman-Pearson-Fisher War

The null hypothesis has one of the most contentious origin stories in science. Three giants fought over it, and we ended up with a Frankenstein hybrid that none of them would fully endorse.

Ronald Fisher (1890-1962) — The First Formulation

Fisher introduced the concept of the null hypothesis in the 1920s. His framework was:

State a null hypothesis (H₀): "There is no effect"
Calculate a test statistic from your data
Find the p-value: the probability of getting data this extreme (or more extreme) IF H₀ is true
If the p-value is small enough, reject H₀

Fisher's key idea: The null hypothesis is a straw man — you set it up to knock it down. You never "accept" H₀. You either reject it or "fail to reject" it. The evidence either disproves H₀ or is insufficient to disprove it.

Fisher did NOT use a fixed α threshold. He reported exact p-values and let the researcher judge: p=0.001 is stronger evidence against H₀ than p=0.04. He considered α=0.05 "convenient" but not sacred.

Fisher did NOT use an alternative hypothesis. In his framework, you only specify H₀. If you reject it, you conclude "something is happening" without formally specifying what that "something" is.

Jerzy Neyman (1894-1981) and Egon Pearson (1895-1980) — The Rival Framework

Neyman and Pearson (Egon was the son of Karl Pearson who coined "standard deviation") developed a competing framework in the 1930s:

State TWO hypotheses: H₀ (null) AND H₁ (alternative)
Choose α (Type I error rate) BEFORE collecting data
Choose β (Type II error rate) and calculate required sample size
Collect data, compute test statistic
If test statistic exceeds the critical value → reject H₀ in favour of H₁
If not → fail to reject H₀ (NOT "accept H₀")

Neyman-Pearson's key difference: They introduced the ALTERNATIVE hypothesis and the concept of TYPE I and TYPE II errors. Fisher only had H₀. Neyman-Pearson had both H₀ and H₁, and explicitly balanced the risk of wrongly rejecting H₀ (Type I) against the risk of wrongly failing to reject it (Type II).

Neyman-Pearson also introduced POWER (1 - β) — the probability of correctly rejecting H₀ when H₁ is true. Fisher never used the concept of power.

The War

Fisher HATED Neyman-Pearson's framework. He called it "childish" and "horrifying." His objections:

Fixed α is stupid. Fisher believed p-values should be interpreted in context, not compared against a rigid threshold. A p=0.049 should not be treated fundamentally differently from p=0.051.

The alternative hypothesis is unnecessary. Fisher believed you should only specify what you're trying to disprove (H₀), not what you're trying to prove (H₁). Specifying H₁ biases the experiment.

Decision-making is not the goal of science. Fisher saw his framework as measuring EVIDENCE AGAINST H₀. Neyman-Pearson saw theirs as a DECISION PROCEDURE (accept or reject). Fisher thought reducing science to binary decisions was reductive.

What We Actually Use — The Frankenstein Hybrid

Modern medical statistics is a bastardised hybrid of both frameworks that neither Fisher nor Neyman-Pearson would recognise:

Feature	Fisher	Neyman-Pearson	What We Actually Do
Hypotheses	H₀ only	H₀ and H₁	H₀ and H₁
α	Not fixed, p-value is continuous	Fixed before data collection	"Fixed" at 0.05 but also report exact p-values
p-value	Continuous measure of evidence	Compared to α, binary decision	Both — report p-value AND compare to α=0.05
Power	Not used	Central to design	Used for sample size but often ignored in interpretation
Language	"The evidence against H₀ is strong/weak"	"Reject H₀ / Fail to reject H₀"	"Statistically significant" (a phrase neither invented)

Nobody intended the system we have. Fisher would be appalled that we use fixed α=0.05 as a universal threshold. Neyman would be appalled that we interpret p-values as evidence strength. Pearson would be appalled at the entire mess.

But it works well enough for regulatory decision-making. The hybrid gives us a framework for controlling error rates (Neyman-Pearson's contribution) while also quantifying evidence strength (Fisher's contribution). The philosophical inconsistency bothers statisticians. The practical utility serves medicine.

Why Is "Null" Confusing? — The Dictionary Collision

Source	"Null" means...
Legal	"Null and void" = invalid, having no legal force
Computing	Null = empty, no value assigned, undefined
Everyday	"Null result" = a failed experiment
Statistics	Null = the assumption of ZERO effect (not "failed" or "invalid")

The Three Confusions

Confusion 1: "Null" ≠ "failed"

When a study "fails to reject the null hypothesis," students think the study "failed." It didn't. It produced a valid result — the result is that there wasn't enough evidence against H₀. That's information, not failure.

"We found no significant difference" is a result, not a failure. But the word "null" makes it FEEL like failure because "null" in everyday language means "nothing, void, worthless."

Confusion 2: "Fail to reject H₀" ≠ "Accept H₀"

The double negative is agonising. "Fail to reject" is not the same as "accept." Not being able to disprove innocence is not the same as proving innocence.

Why don't we just say "accept H₀"? Because failure to find evidence against H₀ doesn't mean H₀ is true. It might mean:

The effect is real but your sample was too small (underpowered)
The effect is real but your measurement was too imprecise
The effect is real but in a different direction than you tested (one-tailed vs two-tailed)

"Fail to reject" preserves the ambiguity. "Accept" falsely implies certainty.

Confusion 3: H₀ is what you DON'T believe

In every other domain, you state what you believe and try to prove it. In court, the prosecution states their case and tries to prove it.

In statistics, you state what you DON'T believe (H₀) and try to disprove it. It's like a prosecutor saying: "Let me assume the defendant is innocent, and then show you the evidence is so overwhelming that this assumption is untenable."

This is proof by contradiction — the same logic used in mathematical proofs. Assume the opposite of what you want to show. Demonstrate that the assumption leads to an absurd conclusion (p < 0.05 = "if H₀ were true, getting data this extreme would be absurdly unlikely"). Therefore, the assumption must be wrong. Therefore, H₀ is probably false.

It's rigorous. It's powerful. It's deeply unintuitive for anyone who hasn't done mathematical logic.

Why Start with "Nothing Is Happening"? — The Deep Reason

The Courtroom Analogy

The null hypothesis mirrors the legal principle of presumption of innocence:

Legal System	Statistical System
Defendant is presumed innocent	Drug is presumed ineffective (H₀)
Prosecution must prove guilt	Researcher must prove efficacy
Beyond reasonable doubt	p < 0.05 (or the chosen α)
Guilty verdict	Reject H₀
Not guilty verdict	Fail to reject H₀
Not guilty ≠ innocent	Fail to reject ≠ H₀ is true

Why presume innocence? Because the consequences of wrongly convicting an innocent person (Type I error) are considered worse than wrongly acquitting a guilty person (Type II error). The system is deliberately conservative — it would rather let 10 guilty people go free than convict 1 innocent person.

Why presume no drug effect? Because the consequences of approving an ineffective (or harmful) drug (Type I error) are considered worse than failing to approve an effective drug (Type II error). The system would rather reject 10 effective drugs than approve 1 harmful one.

This asymmetry is a moral choice, not a mathematical one. Society decided that protecting the public from harm is more important than maximising access to treatments. H₀ embodies that choice.

The Parsimony Principle

There's a deeper philosophical reason: Occam's Razor.

"Do not multiply entities beyond necessity." The simplest explanation should be preferred until evidence demands a more complex one.

H₀ is the simplest explanation: nothing is happening. The groups are the same. The treatment has no effect. The correlation is zero. Any observed difference is just random noise.

H₁ is the more complex explanation: something IS happening. A real effect exists.

Starting with H₀ means requiring evidence before accepting complexity. You don't get to claim your drug works just because your sample showed a difference. You have to show that the difference is too large to be plausibly explained by chance alone (H₀).

The Regulatory Dimension

FDA and the Null Hypothesis — Structural Conservatism

The entire FDA approval process is built on the null hypothesis framework.

The sponsor (pharmaceutical company) must provide "substantial evidence of effectiveness" (Federal Food, Drug, and Cosmetic Act, Section 505). This means: the sponsor must disprove H₀, not merely suggest H₁.

1. Pre-specification of H₀ and H₁

ICH E9 requires that the null and alternative hypotheses be pre-specified in the Statistical Analysis Plan (SAP) before the study is unblinded.

Why? Because if you specify H₀ AFTER seeing the data, you can tailor it to get a significant result (p-hacking). Pre-specification ensures that the hypothesis was not influenced by the data.

Real example of this going wrong: A trial tests 5 different dose groups against placebo. After unblinding, only the 200mg dose "works." The sponsor rewrites the SAP to specify H₀ as "200mg = placebo" and claims a pre-specified positive result. FDA catches this because the original SAP specified H₀ as "any dose = placebo" with multiplicity correction.

2. The Two-Trial Rule

FDA traditionally requires TWO adequate and well-controlled pivotal trials, each rejecting H₀ at α=0.05 (two-sided).

The mathematical logic: if each trial independently rejects H₀ at α=0.05, the probability of BOTH being false positives (both wrongly rejecting H₀) = 0.05 × 0.05 = 0.0025 (1 in 400).

One trial at α=0.05 → 1-in-20 chance of false positive. Two trials → 1-in-400 chance. That's the statistical logic behind the two-trial requirement.

Exceptions exist: For rare diseases, serious conditions with unmet need, or when a single trial is extraordinarily persuasive (massive effect size, very low p-value, consistent secondary endpoints), FDA may accept one pivotal trial. But the DEFAULT is two — because the null hypothesis framework demands replication to protect against false positives.

3. One-Sided vs Two-Sided H₁

Word Surgery: "One-Sided" and "Two-Sided"

Why these names? Because they describe which SIDE(S) of the distribution you're examining.

A two-sided test asks: "Is the drug different from placebo?" (could be better OR worse) → H₀: Drug = Placebo. H₁: Drug ≠ Placebo. → You reject H₀ if the difference falls in EITHER tail (either side).

A one-sided test asks: "Is the drug better than placebo?" (only better, not worse) → H₀: Drug ≤ Placebo. H₁: Drug > Placebo. → You reject H₀ only if the difference falls in ONE tail.

FDA's position: Almost always requires two-sided tests at α=0.05 (equivalent to one-sided at α=0.025). Why? Because a one-sided test ASSUMES the drug can't be harmful — which is medically dangerous. A drug might make things worse, and a one-sided test would never detect it.

The exception: Non-inferiority trials use one-sided testing by convention (you're only asking "is the new drug no worse than the standard?").

4. Composite Null Hypotheses in Oncology

Modern oncology trials often have composite H₀:

"H₀: The drug does not improve OS AND does not improve PFS" (two co-primary endpoints).

The alpha is split: test PFS at α=0.025, test OS at α=0.025 (total α=0.05). Or use hierarchical testing: test PFS first at full α=0.05, and only if significant, test OS at α=0.05.

The H₀ structure determines the multiplicity correction strategy, which determines whether the results are confirmatory or exploratory. Getting the null hypothesis wrong at the design stage can invalidate the entire trial.

5. Futility Analysis — When H₀ Wins Early

During interim analyses, a Data Safety Monitoring Board (DSMB) may stop a trial for futility — when the data are so consistent with H₀ that continuing the trial would be pointless (no realistic chance of rejecting H₀ by the end).

This is formalized as conditional power under H₀: "If H₀ is true, what is the probability of getting a significant result by the end of the trial?" If conditional power is <10%, the trial is futile.

Futility stopping saves patients from exposure to an ineffective drug and saves sponsors millions in trial costs. It's the null hypothesis doing useful work — not as a target to disprove, but as a realistic assessment that the drug genuinely might not work.

Branch-by-Branch — Where the Null Hypothesis Bites You

General Medicine

The scenario: You run a study comparing two antihypertensives. Mean BP reduction: Drug A = 14 mmHg, Drug B = 12 mmHg. p=0.35.

Your conclusion: "No significant difference. Both drugs are equally effective."

What you actually showed: You failed to reject H₀ (Drug A = Drug B). But you did NOT prove they're equal. The 2 mmHg difference might be real — your study might simply have been too small to detect it.

The correct conclusion: "We found insufficient evidence to conclude that Drug A and Drug B differ in BP reduction. The observed 2 mmHg difference could be due to chance (p=0.35), but our study may have been underpowered to detect a difference of this magnitude."

The H₀ trap: "Not significant" ≠ "equal." "Fail to reject H₀" ≠ "H₀ is true." This single misinterpretation fills journals with false claims of equivalence from underpowered studies.

Surgery

The scenario: A trial of a new surgical technique: "No significant difference in complication rates (p=0.22)."

The paper recommends: "Both techniques are equivalent. Use whichever is convenient."

The H₀ trap: The study had n=40 per group. Power to detect a clinically meaningful difference (5% vs 15% complication rate) at α=0.05 → power = 35%. The study was DESIGNED to fail. With 35% power, you had only a 35% chance of finding a real difference even if it existed.

"Failing to reject H₀ when your power is 35% is like failing to find a burglar when you only searched one room of the house." You didn't prove the house is empty. You barely looked.

What an equivalence claim actually requires: A properly designed EQUIVALENCE or NON-INFERIORITY trial with pre-specified margins, adequate power (≥80%), and demonstration that the CI falls within the equivalence margin. That's MUCH harder than just "failing to find a difference" in an underpowered superiority trial.

Paediatrics

The scenario: A study of a new paediatric antibiotic: "Primary endpoint not met (p=0.08). However, significant improvement seen in the subgroup of children aged 2-5 (p=0.03)."

The H₀ trap: The primary H₀ was not rejected. The subgroup finding has no formal Type I error protection because the gate (primary endpoint significance) wasn't passed. In a hierarchical testing framework, p=0.03 in a subgroup has no confirmatory value if the primary H₀ stands.

The paper reports the subgroup as a "positive finding." It's an exploratory observation dressed up as a conclusion. The null hypothesis framework says: the primary H₀ stood. Everything downstream is hypothesis-generating, not hypothesis-confirming.

Obstetrics

The scenario: A trial of progesterone for preventing preterm birth: "No significant reduction in preterm birth rate (p=0.07)."

One camp says: "p=0.07 is close! The drug probably works." Other camp says: "Didn't reach significance. H₀ stands."

What Fisher would say: p=0.07 is moderate evidence against H₀. Not convincing, but suggestive. Worth pursuing in a larger trial.

What Neyman-Pearson would say: You pre-specified α=0.05. p=0.07 > 0.05. Decision: fail to reject H₀. Done.

What the hybrid system does: Reports p=0.07 as "not significant" (Neyman-Pearson) but also notes it's "trending toward significance" or "borderline" (Fisher-ish). This is where the Frankenstein hybrid creates clinical confusion.

The practical lesson: p=0.07 means the data are somewhat unlikely under H₀ but not unlikely ENOUGH to meet the pre-specified threshold. Whether to pursue a confirmatory trial depends on clinical context (severity of prematurity, unmet need, safety profile), not on whether 0.07 is "close to" 0.05.

Psychiatry

The scenario: An antidepressant trial. H₀: Drug = Placebo on HAM-D change.

The placebo group improves by 8 points. The drug group improves by 10 points. Difference: 2 points. p=0.04.

H₀ is rejected! But is the result meaningful?

The H₀ trap: The null hypothesis only asks "is the difference ZERO?" It does not ask "is the difference CLINICALLY IMPORTANT?" A statistically significant 2-point difference on a 52-point scale (MCID ≈ 3 points) means: we've proven the drug does SOMETHING — but that something is clinically trivial.

Rejecting H₀ proves that the drug is not identical to placebo. It does NOT prove the drug is useful. H₀ is a LOW BAR. Beating it means you're better than nothing. Being better than nothing is necessary but not sufficient for clinical value.

This is why FDA's psychiatric drug reviews increasingly consider clinical significance (response rates, remission rates, effect sizes) alongside statistical significance (p-values, H₀ rejection).

Community Medicine / PSM

The scenario: An ecological study: "Significant negative correlation between district literacy rate and infant mortality (r = -0.45, p=0.01)."

H₀: r = 0 (no correlation). H₀ rejected. p=0.01.

The H₀ trap: Rejecting H₀ means the correlation is not zero. It does NOT mean:

Literacy CAUSES lower infant mortality (could be confounded by income, healthcare access, sanitation)
Increasing literacy WILL reduce infant mortality (ecological correlation ≠ individual causation)
The relationship is strong (r² = 0.20, so literacy explains only 20% of mortality variation)

The null hypothesis framework is a filter for noise. It is not a filter for confounding, causation, or clinical relevance. Passing through H₀ (p < 0.05) gets you through ONE gate. There are many more gates (confounding, bias, effect size, external validity) that H₀ doesn't guard.

Orthopaedics

The scenario: Implant A vs Implant B. "No significant difference in 10-year revision rates (p=0.12)."

Registry data, n=500 per implant. The study concludes: "Both implants have equivalent long-term performance."

The H₀ trap: Revision rates were 4% (Implant A) vs 7% (Implant B). A 3% absolute difference. In a patient population of 100,000 TKRs per year, that's 3,000 extra revisions per year — each costing ₹3-5 lakh and involving significant morbidity.

The study "failed to reject H₀" not because the implants are equivalent, but because n=500 per group provided insufficient power to detect a 3% absolute difference in revision rates (power ≈ 45%).

3,000 extra revisions per year were hidden behind a "non-significant p-value" because the study was too small. The null hypothesis wasn't proven true — it simply wasn't disproven. And the clinical consequence of that distinction is enormous.

The 6 Ways Not Knowing the Null Hypothesis Destroys You

1. You treat "not significant" as "no effect"

This is the cardinal sin. "Fail to reject H₀" means the EVIDENCE was insufficient. It does NOT mean H₀ is true. An underpowered study will "fail to reject H₀" even when the drug genuinely works. The study failed, not the drug.

The fix: Always ask: "What was the power?" If power was <80%, a non-significant result tells you nothing — the study wasn't designed to detect the effect.

2. You treat "significant" as "important"

Rejecting H₀ means the effect is not zero. It doesn't mean the effect is large, clinically meaningful, or worth acting on. A p=0.001 for a 0.5 mmHg blood pressure difference is statistically significant and clinically useless.

The fix: Always look at the effect size and confidence interval, not just the p-value. H₀ rejection is necessary but not sufficient for clinical relevance.

3. You can't formulate your thesis hypothesis correctly

Your examiner asks: "State your null and alternative hypotheses."

Wrong: "H₀: The new drug is better than the old drug." Wrong: "H₀: There is a significant difference."

Right: "H₀: There is no difference in FEV1 improvement between Drug A and Drug B at 12 weeks. H₁: There is a difference in FEV1 improvement between Drug A and Drug B at 12 weeks."

H₀ is always the boring one. The nothing-happening one. The default. Students chronically confuse H₀ and H₁, or state H₀ as what they hope to find. If your H₀ is what you WANT to be true, you've flipped the framework upside down.

4. You don't understand why two trials are needed for drug approval

The two-trial requirement is Type I error multiplication: α × α = 0.05 × 0.05 = 0.0025. Both trials independently reject H₀ → combined false positive rate = 0.25%.

If you don't understand H₀, you don't understand why this multiplication works, why single-trial approvals carry higher risk, or why FDA sometimes accepts one trial (when the effect is so large that the single-trial p-value is itself vanishingly small).

5. You can't distinguish exploratory from confirmatory research

Confirmatory: H₀ and H₁ stated before data collection. Analysis plan pre-specified. α controlled. Exploratory: No pre-specified H₀. Data-driven. Multiple comparisons without correction.

A confirmatory rejection of H₀ is evidence. An exploratory "significant finding" is a hypothesis for the next study. Confusing the two fills the medical literature with unreproducible results.

6. You can't participate in the biggest debate in modern statistics

The 2019 ASA statement ("Moving to a world beyond p < 0.05"), signed by 800+ statisticians, called for the end of the mechanical H₀ rejection framework. The debate — should we abolish the null hypothesis significance testing (NHST) framework? — is the most consequential methodological debate in modern medicine.

Alternatives proposed: Bayesian methods, estimation (CIs without testing), effect sizes with uncertainty intervals, decision-theoretic frameworks.

If you don't understand H₀, you can't understand why people want to reform it, what the alternatives are, or how this debate affects the future of evidence-based medicine. You're watching a revolution from the outside.

The One Thing to Remember

The null hypothesis is medicine's safety valve. It's the system saying: "Prove it."

You don't get to claim your drug works, your technique is better, or your risk factor causes disease just because your sample shows a difference. You have to show that the difference is too large to be explained by chance — by starting from the assumption that chance explains everything.

It's counterintuitive. It's frustrating. It occasionally blocks effective treatments (Type II errors). But it protects millions of patients from ineffective or harmful interventions that LOOKED promising in small, noisy samples.

The null hypothesis is not your enemy. It's not a bureaucratic hurdle. It's the mathematical embodiment of a moral principle: before we expose patients to a new treatment, we demand evidence that it actually works. That demand starts with H₀.

H₀ says: "Nothing is happening." Your data says: "Something might be." The p-value asks: "How surprised would we be to see this data if nothing were really happening?" And if the surprise is large enough — if the data are too extreme to be plausibly explained by H₀ — the null falls, and evidence wins.

That's the logic. It's 90 years old. Fisher started it. Neyman and Pearson formalised it. The hybrid we use today is philosophically messy but practically effective. And every drug you prescribe, every guideline you follow, and every paper you read was built on this foundation of deliberate, systematic doubt.