What Does "p = 0.03" Actually Mean? (Not What You Think.)

Stateazy Series

What Does "p = 0.03" Actually Mean? (Not What You Think.)

The Problem First

You're a dermatology resident. Journal club. The paper says:

"Patients treated with Drug X showed significant improvement in PASI-75 response compared to placebo (62% vs 34%, p = 0.03)."

Your professor asks: "What does p = 0.03 mean?"

You say: "There's a 3% chance the drug doesn't work."

Wrong.

Your friend says: "There's a 3% chance the results are due to chance."

Also wrong. (But closer.)

The senior resident tries: "There's a 97% probability the drug works."

Completely wrong. That's 1 minus the p-value, and that's not how any of this works.

Three residents. Three different wrong answers. All confident. All intuitive. All incorrect.

The p-value is the most reported, most cited, most misunderstood number in all of medical science. It appears in every paper you'll ever read. It decides which drugs get approved. It decides which papers get published. It decides which treatments your patients receive.

And almost nobody interprets it correctly.

Before the Term — What Problem Are We Solving?

You ran a study. Your drug group did better than placebo. But you know that random variation exists — even if the drug is useless, the drug group might look better just by luck. Some patients randomly recover. Some randomly get worse. The dice of biology don't land evenly.

So the question is:

"Could the difference I'm seeing have appeared by pure dumb luck, even if the drug does absolutely nothing?"

The p-value answers this question. But it answers it in a very specific, very narrow, very counterintuitive way. And the narrowness of the answer is exactly where everyone gets confused.

Word Surgery: "p-value"

"p"

What it stands for: "probability"

Root: Latin probabilis = "worthy of approval" / "likely to be proved" → From probare = "to test, to prove, to approve" → From probus = "good, honest, upright"

Literal meaning: "the degree to which something can be proved or tested"

→ Aha: "Probability" literally = "prove-ability." How provable is something? How testable? The p in p-value stands for "the probability of..." — but the probability of WHAT? That's where everyone gets confused.

"Value"

Root: Latin valere = "to be strong, to be worth" → Old French value = "worth, price"

Literal meaning: "the numerical worth" → just "the number"

→ So "p-value" literally = "the probability number." A single number that captures a specific probability. Simple name. Deceptively complex concept.

Why Not "Probability Value" in Full?

Because Fisher and his contemporaries used "P" as shorthand in their statistical tables. Fisher's 1925 Statistical Methods for Research Workers uses "P" throughout — capital P, no "value" attached. The full phrase "p-value" emerged gradually in textbook usage. Fisher just wrote P = 0.03. The word "value" was appended later for clarity, the way "DNA molecule" expanded what was originally just "DNA."

Naming Family

Term	What It Is	Relationship to p-value
p-value	Probability of data this extreme, given H₀ is true	THE number
α (alpha)	Pre-set threshold (usually 0.05)	The BAR the p-value must limbo under
Significance	The verdict when p < α	The LABEL applied after comparing p to α
Test statistic (t, z, χ², F)	The intermediate calculation	p-value is derived FROM this
Critical value	Test statistic threshold corresponding to α	Same bar as α, expressed in different units
Effect size	HOW MUCH the groups differ	What p-value doesn't tell you

The relationship in one sentence: The test statistic is calculated from your data, converted to a p-value, and compared against α. If p < α → "significant." The p-value is the bridge between raw data and the binary verdict.

The Actual Definition — Read This Three Times

Here is the correct definition:

The p-value is the probability of obtaining results as extreme as, or more extreme than, the observed results, ASSUMING that the null hypothesis is true.

Every word matters. Let's dissect it:

Phrase	What It Actually Means
"the probability"	A number between 0 and 1
"of obtaining results"	If you repeated the experiment
"as extreme as, or more extreme than"	Equal to OR further from what H₀ predicts
"the observed results"	What your actual data showed
"assuming that the null hypothesis is true"	IN A WORLD WHERE THE DRUG DOES NOTHING

→ The p-value lives in an imaginary world where the drug doesn't work. It asks: "In that imaginary world, how surprising would my data be?"

→ It does NOT say anything about the REAL world. It doesn't tell you the drug works. It doesn't tell you the probability that H₀ is true. It doesn't tell you the probability that H₁ is true. It tells you how weird your data would look IF H₀ were true.

The Courtroom Analogy

Think of a murder trial:

H₀ (null hypothesis) = "The defendant is innocent"
The data = the evidence presented (DNA, witnesses, motive)
The p-value = "If this person IS innocent, how likely is it that all this evidence would exist?"

A low p-value (say 0.01) means: "If the defendant is truly innocent, the chance of seeing evidence THIS damning is only 1%. That's suspicious enough to convict."

A high p-value (say 0.40) means: "If the defendant is truly innocent, there's a 40% chance evidence like this would exist anyway. Not enough to convict."

What the p-value does NOT tell the court: "The probability that the defendant is guilty." That requires considering the PRIOR probability of guilt (how likely was this person to commit this crime before seeing the evidence?), the quality of alternative suspects, and other factors the p-value cannot capture.

This is the fundamental asymmetry: The p-value tells you how surprising the evidence is IF innocence is assumed. It does NOT tell you the probability of innocence or guilt.

Who Invented This? — Fisher's Gift and Curse

Ronald Fisher (1890–1962) — The Single Most Influential Statistician

Ronald Aylmer Fisher gave the world the p-value, ANOVA, maximum likelihood, the design of experiments, and the randomised controlled trial. He was also a committed eugenicist and denied the link between smoking and cancer. Brilliance and moral failure coexisted in one person.

The Lady Tasting Tea (1935)

Fisher's most famous example — and the clearest explanation of a p-value ever given:

The scenario: A lady claims she can tell whether milk was poured into the cup before or after the tea. Fisher presents her with 8 cups — 4 milk-first, 4 tea-first, in random order. She must identify all 8 correctly.

H₀: She's guessing randomly (she has no real ability).

The question: If she IS guessing, what's the probability she gets all 8 right?

The math: The number of ways to choose 4 cups from 8 = C(8,4) = 70. Only 1 arrangement is perfectly correct. So the probability of getting all 8 right by guessing = 1/70 = 0.014.

The interpretation: If she HAS no ability (H₀ true), the chance of her getting a perfect score is 1.4%. That's surprising enough to suspect she genuinely HAS the ability.

p = 0.014. Not "the probability she can taste the difference." Not "the probability the experiment worked." It's "the probability of a perfect score IF she's just guessing."

What Fisher Actually Said About p-values

Fisher (1925):

"The value for which P = .05 ... is convenient to take as a limit in judging whether a deviation is to be considered significant or not."

Fisher (1956):

"No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas."

Fisher intended the p-value as a continuous measure of evidence. Small p = stronger evidence against H₀. Very small p = very strong evidence. He did NOT intend it to be a binary switch. He did NOT intend "p < 0.05 = real, p ≥ 0.05 = not real." That binary usage came from the Neyman-Pearson framework and from the bastardised hybrid that modern statistics became.

Fisher vs. Neyman-Pearson on p-values

	Fisher's View	Neyman-Pearson's View	What We Actually Do
p-value purpose	Continuous measure of evidence	Compared to fixed α for a binary decision	Both simultaneously (incoherently)
Interpretation	"How strongly does this evidence speak against H₀?"	"Does p fall below α? Yes/no."	Report exact p AND declare "significant" or "not significant"
What it means	Smaller p = stronger evidence	p < α → reject H₀, period	Report p = 0.001 as "highly significant" (Fisher's gradient) but also use α = 0.05 as a gate (Neyman-Pearson's binary)

The hybrid we use is philosophically incoherent — but practically functional enough that medicine runs on it.

The 6 Misinterpretations — And Why Each Is Wrong

This is the most important section. If you learn nothing else, learn these.

Misinterpretation 1: "p = 0.03 means there's a 3% chance the drug doesn't work"

Why it's wrong: The p-value is calculated UNDER THE ASSUMPTION that H₀ is true. It cannot simultaneously be the probability that H₀ is true. You assumed H₀ to compute the number. You can't then use the number to evaluate the assumption. That's circular.

What you'd need: Bayes' theorem. You'd need the PRIOR probability that the drug works (before seeing data), combined with the likelihood of the data, to get the POSTERIOR probability. The p-value gives you only the likelihood piece, not the full answer.

The damage: When a doctor tells a patient "there's only a 3% chance this drug doesn't work," they're massively overstating the evidence. The actual probability that H₀ is true (given p = 0.03) depends on the prior probability and could easily be 20-30% for a novel, implausible drug.

Misinterpretation 2: "p = 0.03 means there's a 3% probability the results are due to chance"

Why it's wrong: Subtle but important. The p-value is NOT the probability that chance produced the results. It's the probability that chance would produce results THIS EXTREME OR MORE EXTREME. The "or more extreme" part matters — the p-value covers the entire tail of the distribution, not just the exact observed result.

Also: "Due to chance" is vague. ALL results have a chance component. Even if the drug works, the observed difference includes both the true effect AND random variation. The p-value asks about a specific scenario (H₀ true), not about "chance" in general.

Misinterpretation 3: "p = 0.03 means there's a 97% probability the drug works"

Why it's wrong: This is 1 - p, which is NOT the probability of H₁. This is the prosecutor's fallacy / transposition fallacy.

P(data | H₀) ≠ P(H₀ | data)
P(seeing this evidence | innocent) ≠ P(innocent | seeing this evidence)

The probability of the data given the hypothesis is NOT the probability of the hypothesis given the data. Swapping them is the most common logical error in statistical interpretation.

Misinterpretation 4: "p = 0.03 means the effect size is large / clinically important"

Why it's wrong: The p-value is a function of TWO things: effect size and sample size. A tiny, clinically irrelevant effect (0.1 mmHg BP reduction) becomes "highly significant" (p < 0.001) with a large enough sample (n = 50,000). A large, clinically important effect (15 mmHg BP reduction) can be "non-significant" (p = 0.12) with a tiny sample (n = 12).

The formula intuitively:

p-value ≈ f(effect size × √sample size)

Bigger effect → smaller p. Bigger sample → smaller p. The p-value conflates the two. You cannot untangle them from the p-value alone. You need the confidence interval or effect size to know whether the effect MATTERS.

Misinterpretation 5: "p = 0.50 means there's no effect / the drug doesn't work"

Why it's wrong: A high p-value means the data are CONSISTENT with H₀. It does NOT prove H₀ is true. Absence of evidence is not evidence of absence.

A study with n = 8 and p = 0.50 might simply be too small to detect a real effect. The drug might work beautifully — the study just couldn't see it. This is a Type II error / insufficient power problem.

The critical distinction: "We found no significant difference" ≠ "We found that there is no difference." The first is about the study's ability to detect. The second is a claim about reality. The p-value supports the first statement but can NEVER support the second.

Misinterpretation 6: "p = 0.001 is stronger evidence than p = 0.04"

Why it's partially wrong: Fisher would agree — he viewed p-values as a gradient of evidence. But the comparison is valid only within the same study design, sample size, and context.

p = 0.001 from a well-designed RCT with n = 500 IS strong evidence
p = 0.001 from a data-dredged observational study that tested 200 comparisons without multiplicity correction is essentially NOISE (expected false positive rate: 200 × 0.05 = 10 "significant" findings by chance)

The p-value is not a universal currency of evidence. Its meaning is local — specific to the study that produced it. Comparing p-values across studies with different designs, sample sizes, and multiplicity structures is comparing apples to ostriches.

The ASA Statement — When the World's Statisticians Said "Enough"

In 2016, the American Statistical Association (ASA) took the unprecedented step of issuing a formal statement on p-values. In 2019, an entire issue of The American Statistician was devoted to "moving beyond p < 0.05." This is the statistical establishment saying: "We created this monster. We need to tame it."

The ASA's 6 Principles (2016)

P-values can indicate how incompatible the data are with a specified statistical model. (This is what they DO.)

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. (Misinterpretations 1 and 2 above.)

Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. (The anti-0.05 dogma statement.)

Proper inference requires full reporting and transparency. (Report effect sizes, CIs, number of analyses, all results — not just the "significant" ones.)

A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. (Misinterpretation 4.)

By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. (The p-value needs context: study design, prior evidence, effect size, clinical relevance.)

The 2019 Follow-Up: "Retire Statistical Significance"

Over 800 signatories called for abandoning the term "statistically significant" entirely. Their argument:

The phrase creates a false dichotomy (significant/not significant)
It encourages the fallacy that p = 0.049 and p = 0.051 are fundamentally different
It promotes "bright-line" thinking that substitutes a threshold for judgment
It enables publication bias (journals only publish "significant" results)

Their proposed alternative: Report the point estimate, the confidence interval, and the exact p-value. Let the READER assess significance in context. Stop using "significant" as a binary label.

What happened: Almost nothing. Journals still require p < 0.05. Reviewers still reject papers with p > 0.05. The system is too entrenched. But the ASA statement now gives you intellectual ammunition to push back when someone reduces your research to a single threshold.

The Calculation — What's Actually Happening Under the Hood

You don't need to compute p-values by hand (software does it). But understanding the logic prevents you from being a p-value zombie who reports numbers without understanding them.

The Logic Chain

Step 1: Assume H₀ is true (drug does nothing) ↓ Step 2: Under H₀, what distribution would the test statistic follow? (t-distribution, chi-squared, normal, F, etc.) ↓ Step 3: Calculate the test statistic from your actual data (how many standard errors is your result from the null value?) ↓ Step 4: Find the area in the tail(s) of the distribution beyond your test statistic (this area = the p-value) ↓ Step 5: Compare p to α p < α → "reject H₀" → "statistically significant" p ≥ α → "fail to reject H₀" → "not statistically significant"

Visual Intuition

Imagine a bell curve centred at zero (the world where H₀ is true, the drug does nothing).

Your data produced a test statistic of t = 2.3. That's 2.3 standard errors away from zero.

The p-value is the area in the tails beyond ±2.3. It represents: "What fraction of the bell curve is as far from zero as my result, or farther?"

If that area is small (say 2%) → your data is out in the tails → unusual under H₀ → suspicious → "significant."

If that area is large (say 35%) → your data is near the middle → ordinary under H₀ → not suspicious → "not significant."

The p-value is a measure of SURPRISE. How surprised should you be by your data, IF nothing is really happening?

One-Sided vs Two-Sided p-values

Type	What It Tests	When to Use	p-value is...
Two-sided	"Is there a difference in EITHER direction?"	Default in most clinical trials (ICH E9)	Area in BOTH tails
One-sided	"Is the effect in a SPECIFIC direction?"	Non-inferiority, some bioequivalence	Area in ONE tail

Two-sided p-value = 2 × one-sided p-value (approximately, for symmetric distributions).

The trap: Switching from two-sided to one-sided HALVES your p-value. A "non-significant" two-sided p = 0.08 becomes a "significant" one-sided p = 0.04. This is why pre-specification matters — deciding after seeing the data whether to use one-sided or two-sided is p-hacking.

Branch-by-Branch — Where the p-value Bites You

General Medicine

The scenario: A mega-trial (n = 15,000) tests a statin for primary prevention. LDL reduction: 2 mg/dL. p < 0.001.

"Highly significant!" says the drug rep.

The reality: With 15,000 patients, you have so much statistical power that even a 2 mg/dL difference — clinically meaningless — becomes "highly significant." The p-value screams "real effect!" The effect size whispers "who cares?"

The trap: p < 0.001 with a tiny effect size is a FEATURE of large sample sizes, not evidence of clinical importance. Always look at the confidence interval: if the CI for LDL reduction is 1.2 to 2.8 mg/dL, the ENTIRE range is clinically irrelevant. The "highly significant" p-value is telling the truth — the effect is real. But "real" and "useful" are different things.

Surgery

The scenario: A trial of robotic vs laparoscopic cholecystectomy. Operating time: robotic 62 min vs laparoscopic 58 min. p = 0.04.

Published: "Significantly longer operating time with robotic approach."

The question nobody asked: Is 4 minutes clinically meaningful? It's within the range of anaesthesia variation, surgical experience variation, and case complexity variation. p = 0.04 tells you the 4-minute difference is "real" (unlikely to be pure chance). It does NOT tell you the 4-minute difference matters to the patient, the surgeon, or the hospital budget.

The branch-specific trap: Surgery papers love reporting multiple operative metrics (blood loss, time, conversion rate, complication rate, length of stay). With 5-10 outcomes tested, the probability of at least one false positive at α = 0.05 is 23-40%. The "significant" p = 0.04 on operating time might be the one false positive in a sea of null results. Without multiplicity correction, you can't tell.

Paediatrics

The scenario: A vaccine trial in children. Efficacy: 89% vs 65% (vaccine vs placebo). n = 28. p = 0.09.

"Not significant." The vaccine "doesn't work."

The reality: 89% vs 65% is a massive difference. The vaccine almost certainly works. But with only 28 children (ethical constraints on paediatric trials), the study is hopelessly underpowered. The p-value reflects the SAMPLE SIZE, not the drug's efficacy.

The devastating consequence: The vaccine is not approved for children. Paediatricians must use it off-label (without formal evidence), or children go unvaccinated while adults get the approved product. The p-value threshold designed to protect patients has, in this case, harmed them.

What should have been done: Report the effect size (89% vs 65%, a 24 percentage point difference) and the confidence interval (which would be wide but entirely above zero). Let the clinical and regulatory judgment incorporate the magnitude of the effect, not just whether p crossed an arbitrary line.

Obstetrics

The scenario: A trial of progesterone to prevent preterm birth. Subgroup analysis: p = 0.01 in women with cervical length < 25 mm. p = 0.45 in women with cervical length ≥ 25 mm.

Conclusion: "Progesterone works in women with short cervix but not in women with normal cervix."

The trap: This is a SUBGROUP analysis. The p-values within subgroups DO NOT tell you whether the subgroups are different from each other. To claim the drug works differently in subgroups, you need a test of INTERACTION (does cervical length modify the treatment effect?), not separate tests within each subgroup.

The statistical truth: It is entirely possible for a drug to have p = 0.01 in one subgroup and p = 0.45 in another, even when the TRUE effect is IDENTICAL in both subgroups. The difference in p-values could be entirely due to different sample sizes or different baseline rates in the subgroups.

The rule: "Significant in subgroup A but not significant in subgroup B" ≠ "The treatment effect is different between subgroups." You need a formal interaction test to claim differential effects.

Psychiatry

The scenario: An antidepressant trial. Hamilton Depression Rating Scale (HDRS) improvement: drug 12.3 points, placebo 10.1 points. Difference = 2.2 points. p = 0.01.

"Statistically significant improvement!"

The NICE criterion: A clinically meaningful HDRS improvement is ≥ 3 points (some argue ≥ 4). The observed difference is 2.2 points. The p-value is 0.01. The drug "works" statistically but doesn't meet the threshold for clinical meaningfulness.

The psychiatric trap: Antidepressant trials consistently show this pattern — statistically significant but clinically marginal effects. Meta-analyses (Kirsch et al., 2008) showed that the average drug-placebo difference for antidepressants is ~1.8 HDRS points, well below the clinical threshold, but statistically significant because of large sample sizes.

The implication: The p-value approves drugs that produce real but imperceptible effects. The patient takes a drug with side effects (weight gain, sexual dysfunction, withdrawal symptoms) for a 2-point improvement on a scale they can't feel. The p-value said "significant." The patient says "I don't feel any different."

Community Medicine / PSM

The scenario: A district health survey tests whether 15 risk factors are associated with childhood stunting. Results: 4 factors have p < 0.05.

Policy conclusion: "These 4 factors should be targeted for intervention."

The multiplicity problem: With 15 independent tests at α = 0.05, the expected number of false positives = 15 × 0.05 = 0.75. So roughly 1 of the 4 "significant" findings might be a false positive. Which one? You can't tell without adjusting for multiplicity.

The policy damage: Resources are allocated to address a risk factor that was a statistical false alarm. The REAL risk factors that had p = 0.07 or p = 0.09 (true effects that the study was underpowered to detect) are ignored. The p-value directed policy toward noise and away from signal.

Orthopaedics

The scenario: A trial of PRP (platelet-rich plasma) vs corticosteroid for lateral epicondylitis. VAS pain score at 6 months: PRP = 2.1, steroid = 3.4. Difference = 1.3. p = 0.03.

Published: "PRP is significantly superior to corticosteroid."

At 12 months: PRP = 2.0, steroid = 2.3. Difference = 0.3. p = 0.55.

Not published (the 12-month data appeared in a supplementary table).

The selective reporting trap: The abstract reports the 6-month p-value. The 12-month data — which shows the effect has essentially vanished — is buried. The p-value at one timepoint became the story. The p-value at another timepoint was hidden.

This is the most common form of p-value abuse in orthopaedics: multiple timepoints, multiple outcomes, reporting only the ones that "worked." Each unreported comparison is a hidden multiplicity problem.

Radiology / Diagnostics

The scenario: A new AI algorithm detects breast cancer on mammography. Sensitivity: AI = 94.2%, Radiologist = 93.8%. p = 0.04.

"AI is significantly better than radiologists at detecting breast cancer."

The reality: A 0.4 percentage point difference in sensitivity. In a screening population of 10,000 women with a cancer prevalence of 5/1000, this means AI catches one additional cancer per 10,000 screens. The p = 0.04 was achievable because the study validated on 200,000 images.

The question the p-value can't answer: Is one additional detection per 10,000 screens worth the infrastructure cost of AI deployment, the medicolegal uncertainty, the loss of radiologist expertise, and the potential increase in false positives?

The p-value said "the difference is real." It said nothing about whether the difference is worth acting on.

The 6 Ways Not Knowing p-values Destroys You

1. You worship p < 0.05 without checking effect size

You prescribe a drug that "works" (p < 0.05) but produces a clinically imperceptible improvement. Your patient gets side effects for nothing. The p-value was technically correct. Your clinical decision was wrong.

2. You dismiss p = 0.07 as "no effect"

A small study shows a promising effect (15% mortality reduction) with p = 0.07. You dismiss it. A larger study three years later confirms the effect. Those three years of delay — and the patients who died during them — were the cost of treating a threshold as a truth.

3. You can't read a meta-analysis critically

Meta-analyses pool p-values and effect sizes from multiple studies. If you don't understand what individual p-values mean (and don't mean), you can't evaluate whether the pooled result is meaningful. Garbage p-values in → garbage meta-analysis out.

4. You fall for subgroup p-hacking

A trial fails its primary endpoint (p = 0.15). The authors run 20 subgroup analyses and find one with p = 0.01. They conclude: "The drug works in patients over 65." You believe them. Expected false positives in 20 subgroups at α = 0.05: 1. That "significant" subgroup might be the false positive.

5. You can't evaluate pharmaceutical marketing

Drug reps present p-values without effect sizes, confidence intervals, or clinical context. "p < 0.001!" sounds impressive until you learn the effect size was 0.2 on a 50-point scale. Without understanding what p-values do and don't measure, you're defenceless against statistical marketing.

6. You can't communicate risk to patients

Patient: "Doctor, does this treatment work?" You: "The study showed p = 0.03." Patient: "What does that mean for me?" You: "..."

The p-value tells you about the STUDY. The patient wants to know about THEMSELVES. You need the effect size, the NNT (number needed to treat), the absolute risk reduction — none of which the p-value provides. A doctor who can only cite p-values cannot do patient-centred communication.

What Should You Report INSTEAD of (or Alongside) the p-value?

Measure	What It Tells You	Why It's Better
Effect size (mean difference, Cohen's d, risk ratio)	HOW MUCH the groups differ	Tells you magnitude, not just existence
95% Confidence Interval	The range of plausible effect sizes	Shows precision AND magnitude in one number
Absolute Risk Reduction (ARR)	Actual percentage point reduction in risk	Clinically interpretable
Number Needed to Treat (NNT)	How many patients you need to treat to benefit one	Directly actionable for clinical decisions
Bayes Factor	Ratio of evidence for H₁ vs H₀	Directly answers "how much should I update my belief?"

The ideal reporting: "Drug X reduced mortality by 4.2 percentage points (95% CI: 1.1 to 7.3, p = 0.008, NNT = 24)."

This single sentence tells you:

The effect is real (p = 0.008)
The effect is plausibly between 1.1 and 7.3 percentage points (CI)
You need to treat 24 patients to save one life (NNT)
Whether 24 patients is worth it depends on the drug's cost and side effects (clinical judgment)

The p-value alone told you only the first bullet. The other three are what actually matter for clinical decisions.

The One Thing to Remember

The p-value is a measure of surprise under a specific assumption. It asks: "If nothing is happening, how weird is my data?"

That's it. That's all it does.

It doesn't tell you the probability your hypothesis is true. It doesn't tell you the effect is large. It doesn't tell you the effect is clinically important. It doesn't tell you what to do with your patient.

The p-value is the answer to a question nobody actually wants to ask. What everyone wants to know is: "Does this drug work?" What the p-value answers is: "If the drug doesn't work, how surprised should I be by this data?" Those are profoundly different questions.

Fisher gave us the p-value as a tool for thinking. We turned it into a substitute for thinking. He wanted scientists to use judgment. We wanted a number to replace judgment. The p-value obliged — and medicine has been both helped and harmed by it ever since.

The resident who sees p = 0.03 and asks "What's the effect size? What's the confidence interval? What's the NNT? Was the primary analysis pre-specified? How many comparisons were run?" — that resident understands the p-value.

The resident who sees p = 0.03 and says "Significant" — that resident has been replaced by a calculator. And the calculator doesn't treat patients.