Your Study Found "No Difference." Are You Sure?

Stateazy Series

Why Is Your Study's Ability to Find the Truth Called "Power" — And Why Is It 1-β, Not 1-α?

The Problem First

You're a nephrology resident. You've just defended your thesis: a comparison of two dialysis regimens on serum potassium control. 30 patients per group. Mean difference: 0.4 mEq/L. p = 0.18.

Your examiner says: "Not significant. Your study failed."

You feel crushed. Two years of work. Hundreds of hours. And the answer is "nothing happened."

But your statistician friend looks at the data and says: "Your study didn't fail. It was never designed to succeed."

She runs a quick calculation: With n=30 per group, SD=1.2, and a true difference of 0.4 mEq/L, your study had power = 18%.

That means: even if the difference were REAL, your study had only an 18% chance of detecting it. You had an 82% chance of "finding nothing" even if something was genuinely there.

You didn't test the hypothesis. You performed a ritual with a predetermined outcome. Your study was like trying to photograph a leopard in the dark with a phone camera. The leopard might be there. Your equipment just can't see it.

That ability to see — the probability that your study will detect a real effect when one truly exists — is called power. And most resident theses don't have enough of it.

Word Surgery: "Power"

"Power"

Root: Latin potere → Old French povoir → Middle English power → Original Latin potis (able) + esse (to be) = "to be able"

Literal meaning: "the ability to do something" / "capability"

In statistics: The ability of a study to detect a real effect. The probability of correctly rejecting H₀ when H₁ is actually true.

Why "power" and not "sensitivity" or "detection ability" or "capacity"?

Because Jerzy Neyman and Egon Pearson (1933), who invented the concept, were thinking in terms of a test's strength against a specific alternative. They conceptualised the statistical test as a weapon against falsehood. A "powerful" test is one that reliably destroys a false null hypothesis. A "weak" test lets false nulls survive.

The metaphor is militaristic: the test has POWER to defeat H₀ when H₀ deserves to be defeated. A powerful army captures the enemy fort. A powerful test captures the truth.

→ Aha: "Power" = "the probability that your study is ABLE to find the truth." Low power = your study is weak, like trying to hear a whisper in a storm. High power = your study is strong, like hearing a shout in silence.

The Symbol: 1 - β

And here's where every student gets confused.

Why 1 - β and NOT 1 - α? — The Question Nobody Answers Clearly

This is the single most confusing aspect of power. Let's untangle it from first principles.

First: What Are α and β?

Neyman and Pearson (1933) defined two types of errors, named with the first two Greek letters:

	H₀ is actually TRUE (drug doesn't work)	H₀ is actually FALSE (drug works)
Study says "works" (Reject H₀)	Type I Error (α) — false alarm	Correct! (True Positive)
Study says "doesn't work" (Fail to reject H₀)	Correct! (True Negative)	Type II Error (β) — missed detection

α = probability of saying "it works" when it doesn't (false positive) β = probability of saying "it doesn't work" when it does (false negative)

Now: What Is Power?

Power is the probability of the CORRECT cell in the bottom-right corner's OPPOSITE — the probability of correctly saying "it works" when it truly works.

Looking at the column where H₀ is FALSE (drug truly works):

Probability of MISSING it (saying "doesn't work") = β
Probability of FINDING it (saying "works") = 1 - β

Power = 1 - β = the complement of the miss rate.

Why Not 1 - α?

Let's see what 1 - α would mean.

Looking at the column where H₀ is TRUE (drug truly doesn't work):

Probability of FALSE ALARM (saying "works" when it doesn't) = α
Probability of CORRECT SILENCE (saying "doesn't work" when it doesn't) = 1 - α

1 - α = the probability of correctly staying silent when nothing is happening. That's called the specificity of the test, or the confidence level (for a 95% CI, 1 - α = 0.95).

The Key Distinction

Measure	Formula	What It Answers	Which Column of the Table
α	α	"How often do I cry wolf?"	H₀ TRUE column — false alarm rate
1 - α	1 - α	"How often do I correctly stay quiet when nothing's happening?"	H₀ TRUE column — correct silence
β	β	"How often do I miss the wolf?"	H₀ FALSE column — miss rate
1 - β	1 - β	"How often do I catch the wolf?"	H₀ FALSE column — detection rate = POWER

α and 1-α live in the "nothing is happening" world. They describe your behaviour when H₀ is true.

β and 1-β live in the "something IS happening" world. They describe your behaviour when H₁ is true.

Power is about DETECTION, not about FALSE ALARMS. It asks: "When there IS a real effect, how likely are you to find it?" That's the 1-β column, not the 1-α column.

→ Aha: Think of a smoke detector.

α = how often it beeps when there's no fire (false alarm rate). You want this LOW.
1-α = how often it stays silent when there's no fire. You want this HIGH.
β = how often it FAILS to beep when there IS a fire (miss rate). You want this LOW.
1-β = Power = how often it beeps when there IS a fire (detection rate). You want this HIGH.

You'd never describe a smoke detector's fire-detecting ability as "1 minus the false alarm rate." That would tell you about its behaviour when there's no fire — irrelevant to detection. You want to know: WHEN THERE'S A FIRE, does it go off? That's 1-β. That's power.

Why Students Confuse Them

α and β are both "error" probabilities. Students think: "Error is bad. 1 minus error is good. Power is good. So power = 1 minus error. Which error? α is the famous one. So power = 1-α." Wrong. Power = 1 minus the OTHER error.

α is set at 0.05. So 1-α = 0.95. Power is typically 0.80. Students think: "0.95 sounds like power. 0.80 doesn't." The 0.95 is the confidence level, not the power. They're from different columns of the 2×2 table.

The words don't help. "Significance level" (α) sounds related to "power." They're actually from opposite sides of the hypothesis table. α governs what happens when H₀ is true. Power governs what happens when H₀ is false. They're not on the same axis.

The History — Who Defined Power and Why

Neyman and Pearson (1928-1933) — The Inventors

Jerzy Neyman and Egon Pearson developed the concept of power as part of their formal hypothesis testing framework, published in a series of papers from 1928 to 1933 (Philosophical Transactions of the Royal Society).

Their key insight: Fisher's framework had no way to evaluate a test's ability to detect true effects. Fisher only cared about controlling false positives (α). He never asked: "If the drug works, will my test find it?"

Neyman and Pearson asked exactly that question. They introduced:

The alternative hypothesis (H₁) — for the first time, a formal statement of what "the drug works" means mathematically
Type II error (β) — the probability of missing a real effect
Power (1-β) — the probability of detecting a real effect
The power function — power as a function of effect size: power increases as the true effect gets larger

Why Fisher Hated Power

Fisher refused to accept the concept of power. His objections:

"You don't need H₁." Fisher believed you should only specify H₀ and measure evidence against it. Specifying H₁ presumes you know the alternative, which you don't.

"Power is unknowable." To calculate power, you need to know the TRUE effect size — which you don't know before the experiment. Neyman and Pearson's response: you don't need to know the exact effect size. You specify the MINIMUM clinically important difference, and calculate power for that.

"Science isn't decision-making." Fisher saw hypothesis testing as measuring evidence. Neyman and Pearson saw it as making decisions. Power is a decision-theoretic concept — it optimises the decision procedure. Fisher didn't want science reduced to decisions.

The irony: Fisher rejected power, but every sample size calculation in every clinical trial uses power. The concept he dismissed became the foundation of trial design. You cannot submit an IND to FDA without a sample size justification based on power. Fisher lost this battle completely.

Jacob Cohen (1962, 1988) — The Evangelist

Jacob Cohen, an American psychologist, did more to popularise power analysis than anyone since Neyman and Pearson.

In 1962, Cohen published a devastating paper: he reviewed every study in a major psychology journal and found that the median power to detect a "medium" effect was 48%. More than half of published studies were underpowered — they had less than a coin-flip's chance of finding a real medium-sized effect.

His 1988 book Statistical Power Analysis for the Behavioral Sciences became the bible of power calculations. Cohen introduced standardised effect sizes (Cohen's d, Cohen's f) to make power calculations accessible without domain-specific knowledge of "what's a meaningful difference."

Word Surgery: "Effect Size"

Root: Effect (Latin effectus = "a carrying out, an accomplishment," from ex- out + facere to make) + Size (Old French sise = an amount, a measure)

→ "How much accomplishment" — the magnitude of what the treatment actually does.

Why a separate term? Because the p-value conflates effect size with sample size. A tiny effect with a huge sample gives a small p-value. A large effect with a tiny sample gives a large p-value. "Effect size" isolates the magnitude from the sample size. It asks: "How BIG is the difference?" not "How CERTAIN are we?"

Cohen's standardised effect sizes:

Measure	Small	Medium	Large	What It Compares
Cohen's d	0.2	0.5	0.8	Two means (difference in SDs)
Cohen's f	0.10	0.25	0.40	Multiple means (ANOVA)
Cohen's w	0.10	0.30	0.50	Proportions (chi-squared)
r	0.10	0.30	0.50	Correlation
Odds ratio	1.5	2.5	4.0	Binary outcomes

Cohen's own caveat (often ignored): "The terms 'small,' 'medium,' and 'large' are relative, not only to each other, but to the area of behavioral science or even more particularly to the specific content and research method being employed in any given investigation." He meant these as rough guides, not universal standards. Researchers treat them as gospel.

The Four Ingredients of Power

Power isn't a single number you choose. It's a consequence of four things:

Power = f(α, n, effect size, variability)

The Seesaw Analogy

Imagine power as the height of water in a bucket. Four taps control it:

Tap	Turn it UP → Power...	Turn it DOWN → Power...
α (significance level)	α from 0.05 → 0.10 → Power ↑	α from 0.05 → 0.01 → Power ↓
n (sample size)	More patients → Power ↑	Fewer patients → Power ↓
Effect size (δ)	Larger true difference → Power ↑	Smaller true difference → Power ↓
Variability (σ)	Less noise (smaller SD) → Power ↑	More noise (larger SD) → Power ↓

The relationships:

α and power move together. Loosening your false alarm threshold (larger α) also makes it easier to detect real effects. This is the α-β tradeoff: reducing one error increases the other (at fixed n).
n increases power. This is the most common lever. Can't change the effect size (biology determines it). Can't change σ easily (patient variability is what it is). CAN recruit more patients.
Effect size and power move together. Easier to detect a 20 mmHg BP drop than a 2 mmHg drop. Big effects are loud. Small effects whisper.
Variability and power move OPPOSITELY. High SD = lots of noise = signal hidden = low power. Low SD = quiet background = signal audible = high power.

The Sample Size Formula — Where Power Lives

For comparing two means:

n per group = (Zα + Zβ)² × 2σ² / δ²

Where:

Zα = z-value for the significance level (1.96 for α=0.05 two-sided)
Zβ = z-value for the desired power (0.84 for 80% power, 1.28 for 90%)
σ² = variance (estimated from pilot data or literature)
δ = minimum clinically important difference

Note Zβ is the z-value for 1-β (power), not for β. At 80% power: β = 0.20, 1-β = 0.80, Zβ = Z₀.₂₀ = 0.84. At 90% power: β = 0.10, 1-β = 0.90, Zβ = Z₀.₁₀ = 1.28.

Why 80% and not 95%? Because power and α are on the SAME seesaw. To get 95% power (β = 0.05) at α = 0.05, you'd need massive sample sizes. The convention of 80% power (β = 0.20) represents the pragmatic compromise:

We accept a 5% false positive rate (α) but tolerate a 20% false negative rate (β).

This asymmetry — 5% vs 20% — reflects the moral judgment that approving a harmful drug (Type I) is worse than failing to approve a helpful drug (Type II). The system is biased toward caution. Power of 80% means we ACCEPT that 1 in 5 effective drugs will be missed. That's the price of keeping α at 5%.

Why Is 80% the Convention? — The Historical Bargain

Cohen's Argument (1988)

Cohen argued for 80% power based on a 4:1 ratio of β to α:

If α = 0.05, then β should be no more than 4 × α = 0.20. Power = 1 - β = 1 - 0.20 = 0.80.

His reasoning: a Type I error (false positive) is roughly 4 times as serious as a Type II error (false negative) in typical research. Therefore, we should be 4 times more protective against Type I than Type II.

This 4:1 ratio is Cohen's judgment, not a mathematical derivation. He openly said it was a "soft" recommendation. But it became carved in stone.

ICH E9's Position

ICH E9 (1998) does NOT mandate 80% power. It states:

"The number of subjects in a clinical trial should always be large enough to provide a reliable answer to the questions addressed... The sample size should be justified in terms of... a clinically relevant difference... the probability of erroneously failing to detect this difference should be adequately small (the trial should have adequate 'power')."

"Adequately small" β. Not "β = 0.20." Not "power = 80%." The guidance leaves it to the sponsor.

In practice: FDA expects ≥80% power for pivotal trials, and increasingly expects 90% power for confirmatory studies. A submission with 70% power will trigger questions. A submission with 60% power will trigger a Complete Response Letter.

The Modern Shift Toward 90%

Many recent FDA guidances and EMA recommendations specify 90% power for pivotal trials. The logic:

At 80% power, 1 in 5 trials of a truly effective drug will fail. That's a lot of failed trials at $200M each.
At 90% power, 1 in 10 will fail. Still expensive, but more acceptable.
The cost of increasing from 80% to 90% power (roughly 30% more patients) is often less than the cost of a failed Phase 3 trial.

The Regulatory Dimension

FDA and Power — Where It's Non-Negotiable

1. IND Application — Sample Size Justification

Every Investigational New Drug (IND) application must include a sample size justification. FDA reviews:

What effect size was assumed? Must be clinically meaningful and justified by Phase 2 data or literature.
What variance was assumed? Must be justified by pilot data, Phase 2, or published studies.
What power was targeted? Must be ≥80% (preferably 90%).
What α was used? Must be 0.05 (two-sided) unless otherwise justified.
Are these assumptions reasonable? FDA has seen thousands of trials. They know when assumptions are optimistic.

Real example: A sponsor assumes SD = 8 for HbA1c change based on a Phase 2 study of 60 patients. FDA notes that Phase 3 populations are typically more heterogeneous than Phase 2, and the observed SD in other diabetes trials is 10-12. FDA asks the sponsor to recalculate sample size with SD = 11. The required n jumps from 180 to 340 per group. The sponsor's timeline and budget explode.

The variance assumption in the power calculation is the most consequential number in drug development. Get it wrong by 20% and your trial is either underpowered (misses the effect) or oversized (wastes years and money).

2. Adaptive Sample Size Re-estimation

Because the variance assumption is often wrong, FDA allows adaptive designs where:

Enrol initial patients (e.g., 50% of planned n)
Conduct a blinded interim analysis to re-estimate the variance
Increase sample size if variance is higher than assumed (to maintain power)
Do NOT decrease sample size (to avoid gaming)

This is formalised in FDA's 2019 Guidance on Adaptive Designs. The goal: maintain the TARGET POWER despite uncertain variance estimates.

The interim analysis must be:

Blinded (don't reveal which group is treatment vs placebo)
Pre-specified in the protocol
Conducted by an independent statistical team or DSMB

3. Failed Trials — When Power Was Insufficient

When a pivotal trial fails (p > 0.05), FDA asks: "Was the trial adequately powered?"

If the observed effect is close to what was assumed, but the variance was higher → the trial was underpowered → failure was due to insufficient sample size, not drug inefficacy → FDA may allow a second, larger trial.

If the observed effect is much smaller than assumed → the drug effect was overestimated → the power calculation was based on an optimistic assumption → the drug might genuinely not work well enough.

Real example — Alzheimer's: Multiple Phase 3 trials for amyloid-targeting drugs failed despite "adequate" power calculations. The problem: the assumed effect sizes (from Phase 2) were overly optimistic. Phase 2 patients were less heterogeneous, more carefully selected, and showed larger effects than Phase 3 populations. Power calculations based on Phase 2 effect sizes systematically overestimated power for Phase 3.

4. Non-Inferiority Trials — Power Gets Harder

In a superiority trial, you power for an expected DIFFERENCE. In a non-inferiority trial, you power for a MARGIN — the maximum acceptable amount by which the new drug can be worse.

Non-inferiority margins are typically small (because you need to prove the drug isn't "much worse"). Small margins → you need to detect small effects → you need MORE power → you need LARGER samples.

A non-inferiority trial typically requires 2-4× the sample size of a comparable superiority trial. This is why non-inferiority trials are so expensive and why power calculations for them are scrutinised intensely.

5. Post-Hoc Power — The Statistical Sin

Word Surgery: "Post-Hoc"

Root: Latin post (after) + hoc (this) → "after this"

→ "Post-hoc power" = "power calculated AFTER seeing the results"

This is widely considered statistically invalid. Post-hoc power is a mathematical transformation of the p-value — it contains no information that the p-value doesn't already provide. If p is large, post-hoc power is low. If p is small, post-hoc power is high. It's circular.

FDA statistical reviewers will flag post-hoc power calculations in submissions. If a sponsor says "the trial had 92% post-hoc power," the reviewer responds: "Post-hoc power is a function of the observed p-value and adds no information. Report the effect size, CI, and pre-specified power calculation instead."

The valid version: PROSPECTIVE power — calculated BEFORE the trial, based on ASSUMED (not observed) effect sizes. This is what drives sample size decisions and what regulators evaluate.

Branch-by-Branch — Where Power Bites You

General Medicine

The scenario: A ward conducts a "quality improvement study" comparing two antibiotic protocols for community-acquired pneumonia. n = 25 per group. They find "no significant difference in length of stay (p=0.34)."

They conclude: "Both protocols are equally effective."

The power reality: With n=25 per group and typical LOS variability (SD ≈ 4 days), the study had 32% power to detect a 2-day difference. It had a 68% chance of missing a real 2-day difference.

A 2-day difference in LOS across 10,000 pneumonia admissions per year = 20,000 bed-days = ₹10 crore in hospital costs. The study was too weak to detect an effect worth ₹10 crore annually. And the conclusion of "no difference" is being used to justify continuing with the more expensive protocol.

The fix: Before starting, calculate: "How many patients do I need to have 80% power to detect a 2-day difference?" Answer: ~64 per group. If you can't recruit 64, don't start the study — or don't claim equivalence from the result.

Surgery

The scenario: "No significant difference in recurrence rates between laparoscopic and open hernia repair (p=0.22)."

Recurrence: Laparoscopic 3%, Open 8%. Clinically meaningful 5% absolute difference.

n = 50 per group. Power to detect 5% vs 8% difference in proportions with n=50? Approximately 12%.

The study had a 12% chance of detecting the real difference. An 88% chance of saying "no difference" even though laparoscopic repair is genuinely better.

And based on this "negative" study, a surgeon continues doing open repairs because "there's no evidence laparoscopic is better." There IS evidence. The study just didn't have the power to see it.

Paediatrics

The scenario: Nearly every paediatric drug trial has a power problem.

Ethical constraints → small recruitment pools → small n → low power → "failed" trials → drugs not approved for children → off-label prescribing.

A typical paediatric trial: n=30 per group. For a "medium" effect (Cohen's d=0.5), power ≈ 48%. Coin flip. You'd need n=64 per group for 80% power.

But recruiting 128 children for a single trial may take 3-5 years across multiple centres. By then, the standard of care has changed.

This is the paediatric power paradox: The population that most needs well-powered trials (children) is the hardest population to adequately power trials for.

Regulatory solution: FDA's Pediatric Research Equity Act (PREA) and EMA's Paediatric Regulation mandate paediatric studies but also allow:

Extrapolation from adult data (avoiding underpowered paediatric trials)
Bayesian borrowing (using adult trial data as an informative prior for the paediatric trial, effectively boosting power)
Adaptive enrichment (focusing recruitment on children most likely to respond)

These are all strategies to achieve ADEQUATE POWER despite small achievable sample sizes.

Obstetrics

The scenario: A trial of a new tocolytic to prevent preterm birth. The outcome is preterm delivery rate.

Baseline rate: 12%. Expected reduction with new drug: to 8%. Absolute reduction: 4%.

Power at n=200 per group: 34%. Power at n=500 per group: 72%. Power at n=800 per group: 90%.

The sponsor budgets for n=200 per group. The trial "fails" (p=0.15). They claim "the drug doesn't prevent preterm birth."

The drug might well prevent preterm birth. The trial just didn't have the power to see a 4% absolute reduction. For an event rate difference of 12% vs 8%, you need HUGE samples. This is the curse of binary endpoints with baseline rates close to each other — detecting small absolute differences requires massive n.

The clinical consequence: A potentially life-saving tocolytic is abandoned because the sponsor didn't budget for adequate power. Premature babies continue to be born who could have been prevented. The power calculation wasn't a statistical exercise — it was a life-and-death resource allocation decision.

Psychiatry

The scenario: Psychiatric drug effects are small. Cohen's d for antidepressants vs placebo is typically 0.3-0.5.

To achieve 80% power for d=0.3 at α=0.05 (two-sided), you need:

n = 176 per group.

Most academic psychiatric trials recruit 40-60 per group. Power at n=50 for d=0.3: approximately 22%.

The replication crisis in psychiatry is substantially a POWER crisis. Small trials find "significant" results by chance (inflated effect sizes in underpowered studies — the "winner's curse"). Replication attempts with realistic sample sizes find smaller effects and "fail to replicate."

The original studies weren't wrong about the direction of the effect. They were wrong about its magnitude — because underpowered studies that happen to reach significance systematically overestimate effect sizes (only the luckiest runs of data cross p < 0.05).

Word Surgery: "Winner's Curse"

Origin: From auction theory (economics). The "winner" of an auction for an item of unknown value tends to be the bidder who OVERESTIMATED the item's value. They "won" by overpaying.

In statistics: The study that "wins" (reaches significance in an underpowered field) tends to OVERESTIMATE the true effect. It "won" by being a lucky sample. Replication with adequate power reveals the true, smaller effect.

→ So the "winner's curse" in underpowered research means: the studies that get published (the "winners") are the ones that overestimated the effect. The accurate studies "lost" (non-significant) and were never published.

Community Medicine / PSM

The scenario: A cluster-randomised trial of a hand-washing intervention in 20 schools. Outcome: absence due to diarrhoeal illness.

The sample size calculation assumed:

ICC = 0.01 (low clustering)
Actual ICC = 0.08

With ICC = 0.01: Design effect = 1.3, effective n is adequate, power = 82%. With ICC = 0.08: Design effect = 3.4, effective n drops by 60%, power = 38%.

The trial was powered for a world with ICC = 0.01 but lived in a world with ICC = 0.08. The variance between clusters was 8× higher than assumed. The study was underpowered by a factor of 2.5.

The trial "failed." The hand-washing programme was defunded. Children continued to get diarrhoea.

The power calculation was correct FOR THE ASSUMED ICC. The assumed ICC was wrong. This is why ICH E9 and FDA guidance emphasise that power assumptions must be justified, and why adaptive designs that re-estimate variance (including ICC in cluster trials) are increasingly recommended.

Orthopaedics

The scenario: Comparing two hip implants on Harris Hip Score at 2 years.

Assumed: MCID = 10 points, SD = 15, α = 0.05, power = 80% → n = 36 per group.

Actual SD = 22. Power at n=36 with SD=22: approximately 44%.

The study reports: "No significant difference between implants (p=0.11)." The point estimate is an 8-point difference (close to the MCID of 10).

The implant that might be clinically better is dismissed because the variance was underestimated. The power calculation assumed SD = 15 (from a single-centre pilot). The multicentre trial had SD = 22 (more heterogeneous surgeons, patients, and post-operative protocols).

Lesson: Power calculations based on pilot data from a controlled single-centre environment UNDERESTIMATE the variance of a multicentre pivotal trial. Always inflate the variance assumption by 20-30% when moving from pilot to pivotal.

The 6 Ways Not Knowing Power Destroys You

1. You conclude "no effect" from an underpowered study

This is the most common and most damaging error. A study with 30% power and p = 0.15 tells you NOTHING about whether the treatment works. It tells you the study was too small to answer the question. Concluding "no effect" from low power is like concluding "no fish in the lake" after fishing for 5 minutes with a broken rod.

2. You can't justify your thesis sample size

Examiner: "How did you arrive at n = 40 per group?" Bad answer: "Previous studies used 40." Good answer: "Assuming a minimum clinically important difference of 5 mmHg in systolic BP, SD of 12 mmHg from previous literature, α = 0.05 two-sided, and 80% power, the required sample size is n = 92 per group using a two-sample t-test formula. However, we could only recruit 40 per group, giving us approximately 45% power. We acknowledge this limitation."

The second answer shows you understand power. The first shows you copied a number.

3. You fall for the "winner's curse" in published literature

Published studies in underpowered fields overestimate effect sizes. If you use those inflated effect sizes in YOUR power calculation, you'll underpower your own study. The cycle perpetuates: underpowered study → inflated effect → next study uses inflated estimate → still underpowered → still inflated. This is why meta-analyses are essential for realistic effect size estimation.

4. You don't understand why your study needs more patients than you thought

"But the effect is significant in the literature with n=30!"

Yes, in one lucky study. The published effect size is inflated. The true effect is smaller. You need more patients. Power analysis based on the TRUE (smaller) effect size gives a larger n. This is not a flaw of the method — it's the method correcting for publication bias.

5. You confuse post-hoc power with prospective power

Calculating power AFTER your study has failed is circular and meaningless. "Our post-hoc power was 45%" is just another way of saying "our p-value was large." Prospective power (before data collection) drives design decisions. Post-hoc power drives nothing.

6. You can't evaluate negative trials in the literature

When a meta-analysis includes "negative" trials, understanding their power tells you whether they were truly negative (adequate power, real absence of effect) or uninformative (inadequate power, couldn't detect the effect). A meta-analysis that gives equal weight to adequately powered and underpowered trials is combining signal with noise.

The One Thing to Remember

Power is your study's ability to see the truth. It's the probability that if the drug really works, your study will detect it.

A study without adequate power is not a study. It's a coin flip dressed in a lab coat. And "failing to reject H₀" in an underpowered study tells you nothing about H₀ — it only tells you that your study was too small to answer the question.

Power = 1-β because it's the COMPLEMENT of the miss rate. Not 1-α, which is the complement of the false alarm rate. These are different questions answered in different columns of the 2×2 truth table.

α asks: "When nothing is happening, how often do I stay quiet?" → 1-α = confidence level. β asks: "When something IS happening, how often do I miss it?" → 1-β = power.

The resident who understands power designs theses that can actually answer their questions, reads "negative" trials with appropriate scepticism, and never confuses "not significant" with "not effective."

The resident who doesn't understand power produces studies with 25% power, gets non-significant results, writes "no difference was found," and another potentially useful treatment disappears into the graveyard of underpowered trials — where effective interventions go to die in silence.

Cohen estimated that the median power in published research was 48% in 1962. Surveys in the 2010s found... it was still under 50% in many fields. Sixty years of power analysis evangelism, and half of published research is still a coin flip.

The tool exists. The formulas exist. The software exists. The only thing missing is the understanding of why it matters.

Now you have that understanding. Use it.