Who Decided 0.05? (And Should You Listen?)

Stateazy Series

Who Picked 0.05 — And Why Does That One Number Control All of Medicine?

The Problem First

You're an ENT resident presenting your thesis at a conference. You compared two treatments for chronic sinusitis. Your primary result: p = 0.052.

The moderator says: "Not significant."

Your friend's study, same topic, same design, slightly luckier sample: p = 0.048.

The moderator says: "Significant."

The difference between your careers at this moment — who gets published, who gets the award, whose treatment gets adopted — rests on 0.004. Four thousandths. A rounding error in biology.

You ask the question every honest scientist eventually asks:

"Who decided that 0.05 is the magic number? Why not 0.06? Why not 0.04? Who gave one number this much power over my career, my patients, and my science?"

The answer is both fascinating and infuriating.

Word Surgery: "Level of Significance"

"Level"

Root: Latin libella = "a small balance" / "a spirit level" (the carpenter's tool) → Old French livel → English "level"

Literal meaning: "a horizontal reference line" → "a threshold, a standard, a bar to clear"

In statistics: The bar you set. The threshold. The line the evidence must cross.

→ So "level" = the height of the bar your evidence must jump over.

"Significance"

Root: Latin significare = signum (sign, mark) + facere (to make) → "to make a sign" / "to indicate" / "to point at something"

Literal meaning: "making a sign that something is there"

NOT what you think. In everyday English, "significant" means "important, meaningful, substantial." In statistics, "significant" means "unlikely to be due to chance alone." A drug that lowers BP by 0.1 mmHg can be "statistically significant" with a large enough sample. It is utterly "insignificant" clinically.

→ "Level of significance" literally = "the threshold for declaring that a sign has been made" — the bar your data must clear before you're allowed to say 'something is happening.'"

→ Aha: Think of it as a court's standard of evidence. The "level" is how much evidence you demand before delivering a verdict. Set it too low → you convict too many innocents (Type I errors). Set it too high → you let too many guilty people walk (Type II errors).

The Symbol: α (Alpha)

Root: Alpha (Α, α) is the first letter of the Greek alphabet, from Phoenician ʾālep (ox).

Why α? Neyman and Pearson (1930s) needed symbols for their two error rates. They chose the first two letters of the Greek alphabet:

α (alpha) = probability of Type I error (false positive) = the level of significance
β (beta) = probability of Type II error (false negative)

No deep meaning. Just first letter = first error type, second letter = second error type. Alphabetical order matching conceptual priority: controlling α (protecting against false approvals) was considered more important than controlling β (maximising detection).

Naming Family

Term	Symbol	What It Is	How It Relates
Level of significance	α	The threshold you set BEFORE the study	The bar
p-value	p	The probability calculated FROM the data	What you compare to the bar
Significant	p < α	The data cleared the bar	The verdict
Not significant	p ≥ α	The data didn't clear the bar	The other verdict
Critical value	z, t	The test statistic threshold corresponding to α	Same bar, different unit (test statistic instead of probability)
Rejection region	—	The zone beyond the critical value	The area past the bar

The relationship: α is CHOSEN. p is CALCULATED. If p < α → "significant." That's the entire decision rule. Everything else is commentary.

Who Decided 0.05? — The Fisher Origin Story

Ronald Fisher (1925) — The One Sentence That Changed Everything

In Statistical Methods for Research Workers (1925), Fisher wrote:

"The value for which P = .05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not."

That's it. One sentence. One word: "convenient."

Not "optimal." Not "mathematically derived." Not "the only valid choice." Not "proven to minimise harm." Convenient.

Fisher was writing a handbook for agricultural researchers at Rothamsted Experimental Station. They were testing fertilisers on crops. He needed to give them a practical rule for deciding when a crop yield difference was "real" vs "just weather and soil variation."

He picked 1 in 20 (5%) because:

It was a round number in the framework of the normal distribution (roughly ±2 SD)
It felt like a reasonable false alarm rate — being wrong 1 time in 20 seemed tolerable for farming decisions
It was easy to compute with the tables available in the 1920s (no computers)

Fisher never intended 0.05 to become a universal, rigid, career-determining threshold. He explicitly said p-values should be interpreted as continuous measures of evidence, not dichotomised at any fixed threshold.

In 1956, Fisher wrote: "No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas."

The man who "invented" 0.05 spent the last decades of his life arguing AGAINST using it as a fixed threshold. Nobody listened.

Why Did 0.05 Stick?

Neyman and Pearson's framework needed a fixed α to control long-run error rates. They adopted Fisher's 0.05 not because they agreed with his reasoning, but because it was already conventional.

Textbooks codified it. Once 0.05 appeared in every statistics textbook, it became self-reinforcing. Students learned it as a rule. Reviewers enforced it. Journals required it. Nobody asked why.

Journal editors weaponised it. By the 1960s, "p < 0.05" became the de facto requirement for publication. Results with p = 0.06 were "negative" and unpublishable. Results with p = 0.04 were "positive" and publishable. A binary sorting system for scientific discovery was created from one man's offhand remark about "convenience."

The Alternatives That Existed

Threshold	Who Advocated	Rationale
0.05	Fisher (1925)	"Convenient" for agricultural research
0.01	Common in physics, particle physics uses 5σ (~0.0000003)	Physical sciences demand higher certainty
0.005	Benjamin et al. (2018), 72 statisticians	Proposed to reduce false positives in social/medical sciences
0.10	Some epidemiology, screening studies	When missing a real effect (Type II) is costlier than a false alarm
No fixed threshold	Fisher (later career), Bayesians, ASA (2016, 2019)	p-values should be continuous, not dichotomised

The 2018 proposal to redefine significance as p < 0.005 (published in Nature Human Behaviour, signed by 72 prominent statisticians) argued that the false positive rate at p < 0.05 is unacceptably high — especially when the prior probability of the hypothesis being true is low. Their simulation showed that a "significant" result at p = 0.04 has a false positive probability of 30-50% in many realistic scenarios (depending on prior probability and power).

Nothing happened. The field acknowledged the argument and continued using 0.05. The inertia of 100 years of convention is nearly impossible to overcome.

Why Is "Significance" Confusing? — The Three-Way Collision

Collision 1: Statistical Significance ≠ Clinical Significance

This is the single most dangerous confusion in medical statistics.

Type	Question It Answers	Example
Statistical significance	"Is the effect unlikely to be zero?"	p = 0.001 for a 0.3 mmHg BP reduction
Clinical significance	"Is the effect large enough to matter to patients?"	5+ mmHg BP reduction = clinically meaningful
Practical significance	"Is the effect worth the cost, risk, and effort?"	Does the benefit justify the drug's price and side effects?

A result can be:

Statistically significant but clinically insignificant (large study, tiny effect)
Clinically significant but statistically insignificant (small study, real but undetected effect)
Both (ideal — large study, meaningful effect, p < 0.05)
Neither (large study confirming truly no effect)

When a paper says "significant reduction in mortality," your brain reads "IMPORTANT reduction in mortality." But the paper might mean "a reduction we're confident isn't zero, even though it's only 0.1%." The word "significant" carries clinical weight it was never designed to bear.

Collision 2: "Level" Implies a Measured Quantity

In clinical medicine, "level" usually means a measured value: "serum creatinine level," "significance level." Students naturally interpret "level of significance" as "the AMOUNT of significance" — as if significance comes in quantities, like creatinine.

It doesn't. The "level" is not a measurement OF significance. It's the THRESHOLD FOR significance. It's the bar, not the jump.

"What is the level of significance of your study?" could mean:

"What α did you use?" (correct — asking about the threshold)
"How significant are your results?" (incorrect conflation — asking about the p-value or effect size)

Students commonly answer the second interpretation when asked the first.

Collision 3: α Is Chosen BEFORE, p Is Calculated AFTER

	α (Level of Significance)	p-value
When determined	BEFORE data collection	AFTER data analysis
Who chooses it	The researcher (convention: 0.05)	The data (calculated from the test)
What it represents	Maximum acceptable false positive rate	Actual probability of the observed result under H₀
Fixed or variable	Fixed for the study	Changes with each analysis

The confusion: Students use "significance level" and "p-value" interchangeably. They're not the same thing. α is the BAR. p is the JUMP. "Did the jump clear the bar?" is the question. The bar (α) doesn't change when you look at the jump (p). The jump (p) doesn't set the bar (α).

The Deeper History — Before Fisher

Word Surgery: "Significant" in Science Before Statistics

The word "significant" was used in science long before Fisher. Pierre-Simon Laplace (1749-1827) used the French significatif to describe results that were "meaningful" or "noteworthy" in his probability calculations.

Karl Pearson (1900) used "significant" in his chi-squared paper but without a fixed threshold. He meant: "the deviation is large enough to deserve attention."

William Gosset / "Student" (1908) used phrases like "the difference is significant" to mean "the difference appears real" in his brewery experiments.

Fisher formalised what was already an informal practice. Before Fisher, "significant" was a qualitative judgment by the researcher. Fisher made it quantitative by tying it to a probability threshold. The gain was standardisation. The loss was nuance.

The Bayesian Critique — Why "Level of Significance" Is Philosophically Incoherent

Thomas Bayes (1701-1761) and his intellectual descendants argue that the entire framework of significance testing is backwards.

The Bayesian objection:

What you want to know: "Given my data, what is the probability that the drug works?" → P(H₁ | data)

What the p-value tells you: "Given that the drug doesn't work (H₀), what is the probability of getting data this extreme?" → P(data | H₀)

These are not the same thing. The probability of the data given the hypothesis is NOT the probability of the hypothesis given the data. Confusing them is called the prosecutor's fallacy (or the transposition fallacy).

Word Surgery: "Prosecutor's Fallacy"

Why this name? Because prosecutors make this exact error in court:

"The probability of finding this DNA at the scene if the defendant is innocent is 1 in a million" (correct — this is like the p-value)
"Therefore, the probability that the defendant is innocent is 1 in a million" (WRONG — this transposes the conditional probability)

Similarly:

"The probability of seeing this data if H₀ is true is 0.03" (correct — this IS the p-value)
"Therefore, the probability that H₀ is true is 0.03" (WRONG — but this is how most people interpret p = 0.03)

The level of significance framework lets you control how often you make false positive DECISIONS. It does NOT tell you the probability that any specific decision is wrong. This is a fundamental limitation that no choice of α can fix — it's built into the frequentist architecture.

The Regulatory Dimension

Why FDA Chose α = 0.05 (Two-Sided)

ICH E9 Section 5.5: "It is important that the level of significance of the test should be specified in advance. Two-sided tests at the 5% significance level are recommended."

Why?

1. Historical convention. FDA adopted what the statistical community already used. The 0.05 standard was entrenched by the 1960s when modern drug regulation took shape.

2. Balancing false positives and feasibility. α = 0.01 would require much larger (more expensive, longer) trials. α = 0.10 would let too many ineffective drugs through. 0.05 was the pragmatic compromise.

3. Two-trial multiplication. With two independent pivotal trials each at α = 0.05: combined false positive rate = 0.05 × 0.05 = 0.0025. This is actually quite conservative — a 1-in-400 chance of approving an ineffective drug based on two flukes.

When FDA Uses Different α Levels

Scenario	α Used	Why
Standard pivotal trial	0.05 (two-sided)	Convention. ICH E9 recommendation.
Bioequivalence	0.05 (but expressed as 90% CI within 80-125%)	Two one-sided tests each at α = 0.05 → 90% CI
Non-inferiority	0.025 (one-sided) = 0.05 (two-sided equivalent)	Tests only one direction (not worse than standard)
Interim analysis (first look)	0.001 - 0.005 (O'Brien-Fleming)	Alpha-spending preserves overall α = 0.05 across multiple looks
Rare disease / breakthrough	Sometimes 0.05 one-sided (= 0.10 two-sided equivalent)	Regulatory flexibility for unmet need — accepting higher false positive risk
Accelerated approval (surrogate)	0.05, but on surrogate endpoint	The α is standard but applied to a SURROGATE, which introduces a different kind of uncertainty

The α and Drug Pricing Connection

Here's something nobody teaches:

The CHOICE of α has economic consequences worth billions.

Stricter α (say 0.01) → larger trials needed → higher development cost → higher drug price → fewer drugs developed for rare diseases (can't recruit enough patients)
Looser α (say 0.10) → more drugs approved → more false positives → more post-market withdrawals → public trust eroded → regulatory credibility damaged

α = 0.05 is not just a statistical choice. It's an economic and ethical equilibrium point that balances:

Patient protection (low α → fewer false approvals)
Innovation incentive (achievable α → feasible trials)
Cost containment (moderate α → moderate trial sizes)
Rare disease access (α can't be so strict that rare disease trials become impossible)

FDA's "regulatory flexibility" for rare diseases is essentially α-flexibility. They accept higher false positive risk (effectively higher α) because the alternative — demanding massive trials in diseases with 500 patients worldwide — would ensure no rare disease drug ever gets developed.

Post-Market Surveillance — When α = 0.05 Fails

α = 0.05 protects against false positives at the APPROVAL stage. But:

Safety signals from post-market surveillance use different (often lower) significance thresholds
The FDA Sentinel system monitors millions of electronic health records using sequential testing methods that control α across continuous monitoring
REMS (Risk Evaluation and Mitigation Strategies) are imposed when post-market data suggests the α = 0.05 approval may have let through a safety problem

The level of significance at approval is not the last word. It's the first filter. Post-market surveillance is the second. Withdrawal from market (Vioxx, Relyvrio) is the correction when both filters fail.

Branch-by-Branch — Where the Level of Significance Bites You

General Medicine

The scenario: Two diabetes drug trials.

Trial A: HbA1c reduction = 0.8%, p = 0.001, n = 2000 Trial B: HbA1c reduction = 1.2%, p = 0.07, n = 45

Which drug is "better"?

A naïve reading: "Trial A is significant. Trial B is not. Drug A works. Drug B doesn't."

What α = 0.05 hid: Drug B has a LARGER clinical effect (1.2% vs 0.8%) but was tested in a tiny sample that couldn't reach significance. Drug A has a SMALLER effect that reached significance because the massive sample size gave it enough power.

α = 0.05 is a filter for noise, not a filter for importance. Drug B might genuinely be better. It just couldn't prove it with n = 45.

Surgery

The scenario: A meta-analysis of laparoscopic vs open appendectomy. 47 studies. Some use α = 0.05, some use α = 0.01 (different countries, different traditions). Some pre-specified one-sided tests. Some pre-specified two-sided.

The results are a mess. Studies with similar data reach opposite conclusions because they used different significance thresholds.

The trap: When combining studies with different α levels in a meta-analysis, the binary "significant/not significant" labels are meaningless. What matters is the EFFECT SIZE and CI from each study, not whether it cleared its local α bar. This is why modern meta-analysis uses effect estimates and CIs, not vote-counting of "significant" vs "not significant" studies.

Paediatrics

The scenario: Paediatric clinical pharmacology. A drug is tested in adults (n = 800, p = 0.001) and in children (n = 35, p = 0.12).

The drug "works" in adults and "doesn't work" in children.

Or: The drug works in both, but the paediatric trial was hopelessly underpowered. The α = 0.05 bar is the same height for both trials, but the paediatric study had legs too short to jump it (insufficient power from small n).

This creates a systemic bias against paediatric treatments. Ethical constraints → small samples → underpowered trials → "non-significant" results → drugs not approved for children → off-label prescribing without evidence.

The level of significance doesn't adjust for the difficulty of the trial. α = 0.05 is the same bar for a 2000-patient adult trial and a 35-patient paediatric trial. But clearing that bar with 35 patients requires a MUCH larger effect size. If the drug has a moderate effect (which most drugs do), the paediatric trial will "fail" at α = 0.05 even though the drug works.

Obstetrics

The scenario: A trial of cervical cerclage for preventing preterm birth. p = 0.049.

Published as: "Significant reduction in preterm birth (p < 0.05)."

What actually happened: The original SAP specified a two-sided test at α = 0.05. But midway through the trial, the DSMB performed an interim analysis. No alpha-spending adjustment was made.

The "final" α is not actually 0.05 — it's inflated because of the unaccounted interim look. The true α might be 0.07 or 0.08. The p = 0.049 that "just cleared" α = 0.05 may not have cleared the REAL threshold.

The trap: The "level of significance" in the paper says α = 0.05. The ACTUAL level after accounting for interim analyses is higher. The paper is reporting against the wrong bar. This is a multiplicity / alpha-spending problem disguised as a straightforward significance test.

Psychiatry

The scenario: FDA's review of a new antipsychotic. Two pivotal trials.

Trial 1: p = 0.03 on PANSS total score. "Significant." Trial 2: p = 0.04 on PANSS total score. "Significant."

Both clear α = 0.05. Both clear the two-trial rule. The drug should be approved.

But the FDA statistical reviewer notes:

Both trials had 4 dose groups + placebo (5 arms). Neither trial adjusted α for the 4 pairwise comparisons (each dose vs placebo). The pre-specified primary analysis was "best dose vs placebo" — but "best dose" was selected AFTER unblinding.

The effective α per comparison is 0.05/4 = 0.0125 (Bonferroni) or somewhere between 0.0125 and 0.05 (depending on the multiplicity method). Neither p = 0.03 nor p = 0.04 clears the multiplicity-adjusted threshold.

The drug appears to clear α = 0.05 on the surface. It does NOT clear the appropriate multiplicity-adjusted α underneath. The "level of significance" reported in the abstract (0.05) is not the correct level for this design.

FDA issues a Complete Response Letter. The drug is not approved. The sponsor re-analyses with pre-specified dose selection and multiplicity correction. Two years and $50 million later, they resubmit.

Community Medicine / PSM

The scenario: India's National Health Survey tests associations between 200 risk factors and 50 health outcomes. That's 10,000 potential tests.

At α = 0.05, you expect 500 false positives (10,000 × 0.05) purely by chance.

The survey reports 600 "significant associations." How many are real? Unknown. But at least 500 are expected to be false positives. The "real" findings are drowned in noise.

The trap: The level of significance (α = 0.05) was designed for a SINGLE pre-specified test. When you run 10,000 tests, α = 0.05 PER TEST gives you a FAMILY-wise error rate of nearly 100%. You're virtually guaranteed to find "significant" results that are pure noise.

Without Bonferroni (α = 0.05/10,000 = 0.000005) or FDR correction, the "level of significance" is meaningless in a large-scale survey. But most survey papers don't adjust. They report hundreds of "significant" associations at α = 0.05 per test and let policy-makers treat them as established facts.

Orthopaedics

The scenario: An orthopaedic journal requires p < 0.05 for publication. A study comparing two implants has p = 0.08 with a clinically meaningful 5-point difference in functional scores.

The paper is rejected. "Not significant."

The same study, with 20 more patients (same effect size), would have p = 0.03. "Significant." Published.

α = 0.05 as a publication threshold creates a literature biased toward large studies and against small ones. Studies with meaningful effects but insufficient sample sizes are systematically unpublished. This is publication bias — and the rigid application of α = 0.05 as a publication filter is its primary engine.

The consequence: if you do a meta-analysis of published studies only, you overestimate the treatment effect because negative/inconclusive studies were never published. The "level of significance" intended to protect against false positives has created a systematic bias that inflates false positives in the aggregate literature.

The 6 Ways Not Knowing α Destroys You

1. You treat 0.05 as a law of nature

It's not. It's Fisher's offhand remark about convenience. Different fields use different thresholds. Different regulatory contexts use different thresholds. Even within medicine, bioequivalence uses 90% CI (α = 0.05 per side → 0.10 total), non-inferiority uses one-sided 0.025. The "standard" is not standard.

2. You treat p = 0.049 and p = 0.051 as fundamentally different

They're not. p = 0.049 and p = 0.051 represent almost identical evidence. The data doesn't know about your threshold. The difference of 0.002 is noise. Yet one gets published and the other doesn't, one gets funded and the other doesn't, one changes practice and the other is forgotten. This is the tyranny of the threshold.

3. You don't pre-specify α and manipulate it post-hoc

If your primary analysis gives p = 0.07, and you switch from two-sided (α = 0.05) to one-sided (α = 0.05 → effective two-sided α = 0.10), your result "becomes significant." This is p-hacking. Pre-specification of α, hypothesis direction, and analysis plan BEFORE data collection is the ethical and regulatory requirement.

4. You don't adjust α for multiple comparisons

α = 0.05 per test with 20 tests = 64% chance of at least one false positive. The "level of significance" is only meaningful for a SINGLE pre-specified test. Every additional test inflates the family-wise error rate unless you adjust.

5. You confuse statistical significance with clinical significance

A study with p = 0.0001 for a 0.2 mmHg BP reduction is "highly significant" statistically and completely irrelevant clinically. α controls the false POSITIVE rate. It says nothing about whether the positive finding is meaningful, actionable, or worth the drug's cost and side effects.

6. You can't evaluate why different guidelines cite different evidence thresholds

ACC/AHA guidelines use "Level of Evidence A" (multiple RCTs) differently from NICE's "High quality evidence." When guidelines cite trials that "achieved significance," understanding what α was used, whether multiplicity was controlled, and whether the significance was clinically meaningful is essential for evaluating the recommendation's strength. Without understanding α, guidelines are dogma, not evidence.

The One Thing to Remember

The level of significance is a line in the sand. On one side: "we'll call this real." On the other side: "we'll call this noise."

That line was drawn by one man in 1925 because it was "convenient." It has never been proven to be optimal, correct, or the best possible threshold for medicine. But it has controlled publication, funding, drug approval, and clinical practice for 100 years.

α = 0.05 is not wrong. It's not right. It's a social contract — an agreement between researchers, journals, regulators, and clinicians about how much false alarm risk we'll tolerate. Like all contracts, it's renegotiable. Like all contracts, it works only when everyone understands what they're agreeing to.

The resident who understands α sees p = 0.03 and asks: "What was the pre-specified α? Was it adjusted for multiple comparisons? Is the effect clinically meaningful? What was the power?" Four questions that transform a number into a judgment.

The resident who doesn't understand α sees p = 0.03 and says: "Significant." One word that stops thought where it should begin.

Fisher would be horrified. He wanted you to think. The threshold was supposed to start the conversation, not end it.