Why Can Nobody Remember the Difference Between Odds Ratio and Relative Risk?

Stateazy Series

Why Can Nobody Remember the Difference Between Odds Ratio and Relative Risk?

The Problem First

You're a community medicine resident. Viva. The examiner places a 2×2 table in front of you.

	Disease (+)	Disease (−)
Exposed	30	70
Not exposed	10	90

"Calculate the relative risk and the odds ratio."

You calculate both. You get the right numbers. RR = 3.0. OR = 3.86. You feel good.

Then the examiner asks: "Why are they different? When would you use one vs the other? And why can't you calculate relative risk from a case-control study?"

Silence.

You've calculated these fifty times. You've memorised the formulas. You've drawn the 2×2 table in your sleep. But you can't explain WHY they're different, WHY one is bigger, or WHY the study design forces your choice. The formulas live in your fingers. The understanding never made it to your brain.

This is the single most reliably failed concept in biostatistics vivas across every medical specialty. Not because it's hard. Because the teaching is backwards — you're given two formulas and told to memorise them, without anyone explaining what "odds" actually MEANS and why it's fundamentally different from "risk."

The confusion isn't about the math. It's about the words.

Before the Formulas — What Are We Actually Measuring?

You want to know: does smoking cause lung cancer?

Two ways to express "how much more likely":

Way 1 — Risk thinking: "Smokers are 3 times more likely to get lung cancer than non-smokers."

Way 2 — Odds thinking: "The odds of lung cancer are 3.86 times higher in smokers than non-smokers."

Both sentences describe the SAME data. Both say smoking is bad. But they use different UNITS of measurement — like saying a room is "20 feet long" vs "6.1 metres long." Same room. Different rulers.

The problem: nobody uses "odds" in daily life except gamblers. Doctors think in risk. Patients think in risk. "What's my CHANCE of getting cancer?" is risk-thinking. Nobody walks into a clinic and asks "What are the ODDS of getting cancer?" — even though they might say the word "odds" casually, they mean "chance" (which is risk).

So why does the odds ratio exist at all? Why not just use relative risk for everything?

Because of one brutal mathematical constraint: some study designs physically cannot calculate risk.

Word Surgery: Risk

"Risk"

Root: Italian risco/risico (16th century) → possibly from Arabic rizq (provisions, fortune) or Greek rhiza (cliff, root — as in navigating dangerous cliffs)

Literal meaning: "danger" / "the chance of harm"

In statistics: The PROBABILITY of an event occurring.

→ Risk = Number who got the event / Total number in the group

→ Risk is a proportion. It lives between 0 and 1 (or 0% and 100%).

Why doctors think in risk: Because risk IS probability. "Your risk of stroke is 8%" means "8 out of 100 people like you will have a stroke." It's directly interpretable. Patients understand it. Clinical decisions are based on it.

"Relative Risk" (Risk Ratio)

Root: "Relative" from Latin relativus = "having reference to" / "in relation to"

Literal meaning: "risk IN RELATION TO another risk" → the RATIO of two probabilities

→ RR = Risk in exposed / Risk in unexposed

→ "How many times more likely is the event in the exposed group compared to the unexposed group?"

→ Aha: "Relative" risk = one risk RELATIVE to another. You're dividing one probability by another. The result tells you: 3× more likely, 0.5× less likely, etc.

Word Surgery: Odds

"Odds"

Root: Middle English oddes (16th century) = "unequal things" / "the difference between two unequal quantities"

→ From Old Norse oddi = "point of a triangle" / "the odd one out" / "the unpaired thing"

Original meaning: In gambling, "odds of 3 to 1" meant for every 1 time you win, you lose 3 times. The word itself means "the unequal balance between winning and losing."

In statistics: Odds = Probability of event HAPPENING / Probability of event NOT happening

→ Odds = p / (1 − p)

This is the critical difference. Risk is a proportion. Odds is a RATIO of two complementary proportions.

If risk = 0.20 (20%)	Then odds = 0.20 / 0.80 = 0.25 (or "1 to 4")
If risk = 0.50 (50%)	Then odds = 0.50 / 0.50 = 1.0 (or "even odds")
If risk = 0.80 (80%)	Then odds = 0.80 / 0.20 = 4.0 (or "4 to 1")
If risk = 0.01 (1%)	Then odds = 0.01 / 0.99 = 0.0101 (≈ risk)

→ Aha: When risk is LOW (< 10%), odds ≈ risk. When risk is HIGH, odds DIVERGES wildly from risk. At 50% risk, odds = 1. At 80% risk, odds = 4. The higher the risk, the more odds inflates the number compared to risk.

"Odds Ratio"

Literal meaning: "The ratio of two odds" → one group's odds divided by another group's odds

→ OR = Odds in exposed / Odds in unexposed

→ "How many times higher are the odds in the exposed group compared to the unexposed group?"

Why "Odds" Is Confusing

In everyday English, people use "odds" and "chance" and "risk" and "probability" and "likelihood" interchangeably.

"What are the odds of rain?" → They mean probability.
"The odds are against us" → They mean probability is low.
"Odds of 5 to 1" → Only gamblers use this correctly.

In statistics, "odds" has a PRECISE meaning that is DIFFERENT from probability. But because the everyday meaning conflates them, students hear "odds ratio" and unconsciously process it as "risk ratio." They're not the same. The naming collision between everyday and statistical English is the PRIMARY source of confusion.

Naming Family

Term	Formula	What It Measures	Range
Risk (probability)	Events / Total	Chance of event occurring	0 to 1
Odds	Events / Non-events	Balance between event and non-event	0 to ∞
Relative Risk (RR)	Risk₁ / Risk₂	How many times more PROBABLE	0 to ∞ (1 = no difference)
Odds Ratio (OR)	Odds₁ / Odds₂	How many times higher the ODDS	0 to ∞ (1 = no difference)
Absolute Risk Reduction (ARR)	Risk₁ − Risk₂	Actual percentage point difference	−1 to 1
Number Needed to Treat (NNT)	1 / ARR	Patients to treat to prevent one event	1 to ∞
Hazard Ratio (HR)	Hazard₁ / Hazard₂	Instantaneous rate ratio (survival analysis)	0 to ∞

The Actual Math — Side by Side

Using the examiner's table:

	Disease (+)	Disease (−)	Total
Exposed	a = 30	b = 70	100
Not exposed	c = 10	d = 90	100

Relative Risk

Risk in exposed     = a / (a+b) = 30/100 = 0.30 Risk in unexposed   = c / (c+d) = 10/100 = 0.10 RR = 0.30 / 0.10 = 3.0

→ "Exposed people are 3 times more likely to get the disease."

Odds Ratio

Odds in exposed     = a / b = 30/70 = 0.429 Odds in unexposed   = c / d = 10/90 = 0.111 OR = 0.429 / 0.111 = 3.86

→ "The odds of disease are 3.86 times higher in the exposed group."

The Cross-Product Shortcut

OR = (a × d) / (b × c) = (30 × 90) / (70 × 10) = 2700/700 = 3.86

This is the ad/bc formula every student memorises. It gives the same answer as computing odds separately and dividing. It's mathematically identical — just faster.

Why Is OR Always Bigger Than RR?

When the outcome is common, OR exaggerates the association compared to RR. Always. Here's why:

RR = [a/(a+b)] / [c/(c+d)] OR = [a/b] / [c/d] = (ad)/(bc)

The RR denominators include the diseased people (a+b, c+d). The OR denominators exclude them (b, d).

When few people have the disease (a is small relative to a+b), then a+b ≈ b, and RR ≈ OR.

When many people have the disease (a is large relative to a+b), then a+b is much bigger than b, and RR < OR. The OR "inflates" because its denominator is smaller.

Rule of thumb:

Outcome prevalence < 10% → OR ≈ RR (safe to treat them as similar)
Outcome prevalence 10-20% → OR noticeably > RR (be cautious)
Outcome prevalence > 20% → OR substantially > RR (do NOT interpret OR as RR)

The Inflation Table

Actual RR	Outcome prevalence 5%	Outcome prevalence 20%	Outcome prevalence 50%
1.5	OR ≈ 1.52	OR ≈ 1.65	OR ≈ 2.25
2.0	OR ≈ 2.05	OR ≈ 2.50	OR ≈ 4.00
3.0	OR ≈ 3.15	OR ≈ 4.50	OR ≈ 9.00

At 50% prevalence, a true RR of 3 appears as OR = 9. If you tell a patient "you're 9 times more likely to get this," you've tripled the fear compared to reality.

Why Can't You Calculate RR from a Case-Control Study?

This is the question that separates people who understand from people who memorise.

The Cohort Study (RR is possible)

You start with:

1000 smokers → follow → count who gets cancer
1000 non-smokers → follow → count who gets cancer

You KNOW the total in each group (1000). You can calculate:

Risk in smokers = cancer cases / 1000
Risk in non-smokers = cancer cases / 1000
RR = one risk / the other ✓

You can calculate risk because you started with a DEFINED population and followed it forward. The denominator (total exposed, total unexposed) is real.

The Case-Control Study (RR is impossible)

You start with:

200 lung cancer patients (cases) → look back → how many smoked?
200 healthy controls → look back → how many smoked?

	Cancer	No cancer
Smoker	150	80
Non-smoker	50	120

Can you calculate risk in smokers? Risk = cancer cases among smokers / total smokers = 150 / (150 + 80) = 0.65 = 65%.

But this is MEANINGLESS. The "65%" is an artefact of how many cases vs controls you chose to recruit. If you'd recruited 400 cases and 200 controls, the number would be completely different. The ratio of cases to controls was decided by YOU (the researcher), not by nature.

In a cohort study, the proportion with disease reflects the REAL incidence. In a case-control study, the proportion with disease reflects your RECRUITMENT RATIO.

→ Risk requires a denominator that represents reality. Case-control studies don't have one. Therefore, risk cannot be calculated. Therefore, RR cannot be calculated.

→ BUT — odds ratio CAN be calculated. Here's the magical property:

The OR from a case-control study equals the OR you would have gotten from a cohort study of the same population. The OR is invariant to the sampling scheme. Whether you sample by exposure status (cohort) or by disease status (case-control), you get the same OR.

This is not intuition. This is a mathematical theorem. The cross-product (ad/bc) doesn't depend on the marginal totals — it depends only on the ASSOCIATION between exposure and disease. The recruitment ratio cancels out.

The One-Sentence Explanation

RR needs row totals (exposed total, unexposed total) to be meaningful. In case-control studies, YOU set the row totals by choosing how many cases and controls to recruit. So row totals are artificial. So risk is artificial. So RR is artificial. But the cross-product OR survives because it doesn't depend on the row totals.

Who Invented These? — A Tale of Two Traditions

Relative Risk — The Cohort Tradition

Origin: The concept of comparing disease rates between groups goes back to John Snow (1854), who compared cholera rates between people drinking from different water sources. He didn't call it "relative risk" — he just compared proportions.

The term "relative risk" was formalised in epidemiology in the mid-20th century as cohort studies became the standard design for studying disease causation. Jerome Cornfield (1951) was among the first to formally compare RR and OR.

The logic: Follow people forward in time → count who gets sick → compare rates. Natural. Intuitive. Expensive. Slow (you have to wait for the disease to develop).

Odds Ratio — The Case-Control Necessity

Origin: Jerome Cornfield (1951) proved mathematically that the OR from a case-control study approximates the RR when the disease is rare. This was the theoretical justification for the entire case-control design.

Why case-control studies exist: For rare diseases, cohort studies are impractical. To study a disease that affects 1 in 10,000 people, you'd need to follow 100,000 people for years to get enough cases. Instead, you start with 200 people who HAVE the disease, match them with 200 who DON'T, and look backwards. Faster. Cheaper. The tradeoff: you can only get the OR, not the RR.

Cornfield's proof (the rare disease assumption): When disease prevalence is low, a/(a+b) ≈ a/b and c/(c+d) ≈ c/d. Therefore RR ≈ OR. This is why OR "works" as an approximation of RR — but ONLY when the disease is rare.

The Logistic Regression Complication

Logistic regression outputs odds ratios, not relative risks. Always. Regardless of study design.

This means that even in a cohort study or RCT where you COULD calculate RR directly, if you use logistic regression for adjustment (which most studies do), you get OR. And if the outcome is common, the OR will exaggerate the association compared to the RR.

The fix: Use Poisson regression with robust variance estimation, or log-binomial regression, to get adjusted RR directly. But most researchers default to logistic regression because it's what they learned and what the software defaults to.

The consequence: The medical literature is flooded with ORs being reported and interpreted as if they were RRs, even in cohort studies and RCTs with common outcomes. This systematically inflates reported associations.

Why It's So Hard to Remember — The 5 Sources of Confusion

Source 1: The Words Sound Interchangeable

"Risk" and "odds" are synonyms in everyday English. You have to OVERRIDE your language instinct to keep them separate in statistics. Every time you relax your vigilance, your brain collapses them back into one concept.

Memory anchor: Risk = what happens / everyone. Odds = what happens / what DOESN'T happen. Risk has a TOTAL denominator. Odds has a COMPLEMENT denominator.

Source 2: Both Ratios Equal 1 When There's No Association

RR = 1 means no difference. OR = 1 means no difference. They BEHAVE the same way at the null value. They MOVE in the same direction (both > 1 means increased risk/odds). They LOOK similar in tables. The surface similarity masks the deep difference.

Source 3: For Rare Outcomes, They're Numerically Similar

When disease prevalence is < 10%, RR ≈ OR. So in many studies, the numbers are almost identical. Students see RR = 2.1 and OR = 2.2 and think "same thing." They ARE almost the same — for rare outcomes. The lesson SHOULD be: "they converge when disease is rare and diverge when disease is common." The lesson ACTUALLY learned: "they're basically the same."

Source 4: Textbooks Teach the Formula Before the Concept

Every textbook starts with the 2×2 table and the formulas: RR = [a/(a+b)] / [c/(c+d)], OR = ad/bc.

Students memorise the formulas. They can compute both. But they never internalise WHY odds exists as a separate concept, WHY case-control studies can't use risk, or WHY logistic regression defaults to OR.

The formula is the LAST thing you should learn. The concept is the first.

Source 5: The Rare Disease Assumption Creates False Security

"OR ≈ RR when disease is rare" is taught as a reassuring caveat. Students hear: "Don't worry about the difference, it only matters when disease is common."

But many important medical outcomes ARE common: mortality in ICU (20-40%), surgical complications (10-30%), treatment response (40-80%), preterm birth (10-15%), depression (15-20%). For ALL of these, OR significantly exaggerates the association compared to RR. The "rare disease assumption" doesn't apply to a huge proportion of clinical research.

The Decision Tree — Which One Do I Use?

What study design? │ ├── Case-control study │   └── You can ONLY use OR │       (RR is mathematically impossible) │ ├── Cross-sectional study │   ├── Outcome prevalence < 10% → OR ≈ RR, either is fine │   └── Outcome prevalence ≥ 10% → Use prevalence ratio, not OR │ ├── Cohort study │   ├── Report RR (directly interpretable) │   ├── If using logistic regression → it outputs OR │   │   └── Consider Poisson/log-binomial regression for RR instead │   └── If outcome rare → OR ≈ RR anyway │ └── RCT ├── Report RR or ARR (clinically meaningful) ├── Report NNT (most useful for clinical decisions) └── If using logistic regression → report OR but INTERPRET cautiously └── If outcome common → OR will overestimate the effect

Branch-by-Branch — Where This Confusion Bites

General Medicine

The scenario: A cohort study reports: "Diabetes is associated with heart failure. OR = 2.5, p < 0.001."

Heart failure prevalence in the study: 25%.

The problem: With 25% prevalence, OR = 2.5 does NOT mean "diabetics are 2.5 times more likely to develop heart failure." The actual RR is approximately 1.9. The OR inflated the association by ~30%.

The clinical consequence: A doctor tells a diabetic patient: "You're two and a half times more likely to get heart failure." The truth: about 1.9 times. The doctor unintentionally amplified the patient's anxiety by 30% because the paper reported OR and the doctor read it as RR.

Surgery

The scenario: A systematic review of surgical site infection (SSI) after colorectal surgery. Pooled OR = 3.2 for emergency vs elective surgery.

SSI rate in colorectal surgery: ~15-25%.

The trap: The meta-analysis pooled ORs (because some included studies were case-control). The pooled OR of 3.2 would correspond to an RR of approximately 2.2-2.5. Every guideline that cites "3.2 times increased risk" based on this OR is overstating the association.

The fix that nobody does: Convert pooled OR to RR using the formula: RR = OR / [(1 − P₀) + (P₀ × OR)], where P₀ is the baseline event rate. With P₀ = 0.20 and OR = 3.2: RR = 3.2 / [(0.80) + (0.20 × 3.2)] = 3.2 / 1.44 = 2.22. Very different from 3.2.

Obstetrics

The scenario: A case-control study of risk factors for pre-eclampsia. OR for previous pre-eclampsia = 7.2.

An obstetrician tells a patient with previous pre-eclampsia: "You're 7 times more likely to get pre-eclampsia again."

The problem: Pre-eclampsia incidence is ~5-8% in general population but ~15-25% in high-risk groups. At these prevalences, OR = 7.2 substantially overestimates RR. The actual relative risk might be 4-5.

The deeper problem: This was a case-control study. The OR is the ONLY measure available. You CAN'T convert it to RR without knowing the baseline incidence from an external source. The obstetrician is stuck with a number (OR) that sounds like something it's not (RR), from a study design that can't produce what the patient actually needs (absolute risk).

What the patient needs to hear: "Your risk of pre-eclampsia in this pregnancy is approximately 20-25%, compared to about 5% in the general population." This is absolute risk — far more useful than either OR or RR for clinical communication.

Paediatrics

The scenario: A case-control study of risk factors for childhood leukaemia. OR for paternal smoking = 1.4, p = 0.03.

Published: "Paternal smoking is associated with a 40% increased risk of childhood leukaemia."

The language trap: "40% increased risk" uses the word "risk" — implying RR. But 1.4 is the OR, not the RR. For a rare outcome like childhood leukaemia (incidence ~4 per 100,000), OR ≈ RR, so the approximation is valid here. But the sloppy language — calling an OR a "risk" — trains the reader to equate the two, and that habit will betray them when they encounter a common outcome.

Psychiatry

The scenario: Logistic regression from a cohort study of depression after MI. Adjusted OR = 2.8 for depression at 6 months.

Depression after MI: ~20-30% prevalence.

The problem: The paper says "MI patients have 2.8 times the odds of depression." The clinician reads: "MI patients are 2.8 times more likely to get depressed." With 25% baseline prevalence, the actual adjusted RR would be approximately 2.0-2.2.

Why this happens so often in psychiatry: Psychiatric outcomes (depression, anxiety, PTSD, substance use) are COMMON. Prevalences of 15-40% are typical. Yet the default statistical tool (logistic regression) always outputs OR. The entire psychiatric epidemiology literature is built on ORs for common outcomes — systematically overstating associations.

Community Medicine / PSM

The scenario: India's National Family Health Survey (NFHS-5) reports: "Women with no education: OR = 3.1 for childhood stunting."

Childhood stunting prevalence in India: ~35%.

The overstatement: OR = 3.1 at 35% prevalence corresponds to RR ≈ 2.0. The policy document that says "uneducated mothers' children are three times more likely to be stunted" is wrong by about 50%. Resources allocated based on OR = 3.1 may be disproportionate to the actual effect size.

The systemic problem: Large national surveys with common outcomes (stunting, anaemia, underweight — all with prevalences > 20%) routinely report ORs from logistic regression. Policy-makers interpret them as RRs. The entire quantitative basis for some public health interventions is inflated.

Orthopaedics

The scenario: A case-control study of ACL tears in female athletes. OR for non-contact mechanism = 4.5 compared to male athletes.

ACL tear rate in competitive female athletes: ~3-5% per season.

The good news: At 3-5% prevalence, OR ≈ RR. So OR = 4.5 is genuinely close to "female athletes are ~4.5 times more likely." The rare disease assumption holds here.

The teaching point: This is an example where OR and RR AGREE — because the outcome is rare. The student who understands WHY they agree here (low prevalence) and WHY they wouldn't agree for a common outcome (high prevalence) has grasped the concept. The student who just memorises "OR = ad/bc" hasn't.

The 6 Ways Not Knowing This Destroys You

1. You interpret every OR as if it's an RR

You read OR = 2.5 for a common outcome and tell your patient "you're 2.5 times more likely." You've exaggerated. The RR might be 1.7. Your patient makes a fear-based decision on inflated numbers.

2. You try to calculate RR from a case-control study

Your examiner asks you to interpret a case-control study. You calculate RR from the 2×2 table. The number is meaningless because the case-to-control ratio was artificially set. You fail the viva.

3. You can't evaluate pharma claims critically

A drug company reports OR = 0.45 for stroke with their anticoagulant. If stroke rate is 2%, that's essentially RR = 0.45 — a 55% relative risk reduction. Impressive and genuine. But if stroke rate is 25% (secondary prevention), the actual RR might be 0.60 — a 40% relative risk reduction. Still good, but the company's headline number overstates the benefit by ~35%.

4. You can't explain logistic regression output to a patient

Every adjusted analysis uses logistic regression. Every logistic regression outputs OR. If you can't convert OR to a clinically interpretable measure (RR, ARR, NNT), you can't translate statistical output into patient communication.

5. You misread meta-analyses

Some meta-analyses pool RRs. Others pool ORs. Some mix both without converting. If you don't notice which measure is being pooled, you can't assess whether the summary estimate is interpretable — especially when the outcome is common.

6. You miss the study design signal

Seeing an OR in a paper should make you ask: "Is this a case-control study? Or is this a cohort study that used logistic regression?" The answer changes how you interpret the number. If you don't know which measure belongs to which design, you can't critically appraise anything.

The Memory Trick — Once and For All

Risk vs Odds

Risk = What you tell patients. Events / Everyone. "Your chance of getting this is 20%."

Odds = What gamblers use. Events / Non-events. "The odds are 1 to 4" (same 20%, different framing).

Think of a bag with 20 red balls and 80 blue balls:

Risk of red = 20/100 = 0.20
Odds of red = 20/80 = 0.25

Risk asks: "What fraction of ALL balls are red?" Odds asks: "For every blue ball, how many red balls are there?"

RR vs OR

RR = You followed people forward and compared their RISKS. Direct. Intuitive. "3 times more likely."

OR = You either (a) did a case-control study and CAN'T calculate risk, or (b) used logistic regression which ALWAYS outputs odds. Indirect. Inflated for common outcomes. "3 times the odds."

When They Match

Rare disease (< 10% prevalence): OR ≈ RR. Relax.

Common outcome (> 20% prevalence): OR > RR. Substantially. Don't interpret OR as RR.

The One-Sentence Anchor

Risk is what you HAVE (a chance). Odds is what you BET (a ratio of chances). When the stakes are low (rare disease), the bet looks like the chance. When the stakes are high (common disease), the bet exaggerates the chance.

The One Thing to Remember

The odds ratio exists not because it's more useful than relative risk — it isn't. It exists because some study designs can't calculate risk, and logistic regression can't output risk. The OR is a mathematical necessity, not a clinical preference.

Every time you see an OR, ask two questions: (1) What's the outcome prevalence? (2) Is this from a case-control study or logistic regression? If the outcome is common and this is NOT a case-control study, the authors could have reported RR and chose the lazier option. If it IS a case-control study, OR is all you've got — but don't read it as RR to your patient.

The resident who sees OR = 3.5 and asks "What's the baseline prevalence?" before interpreting it — that resident understands the difference.

The resident who sees OR = 3.5 and says "3.5 times more likely" without checking prevalence — that resident will spend a career exaggerating risk to frightened patients.

The formula is easy. ad/bc. Three seconds on a calculator.

The understanding is what takes a lifetime: the odds ratio is a compromise forced by study design and software defaults, not a number designed for clinical communication. Every time you use it, you owe your patient a translation.