Why Is "Predicting the Future" Called "Going Backwards"?
The Problem First
You're a cardiology resident. A 55-year-old man walks into your OPD. You know his age, BMI (31), systolic BP (148), total cholesterol (240), HbA1c (7.2), smoking status (yes), and family history (positive).
You want to answer one question: What is this man's risk of having a heart attack in the next 10 years?
You can't wait 10 years to find out. You need an answer now. Today. Before he leaves your clinic.
So you use the Framingham Risk Score. You plug in his numbers. It spits out: 18% risk.
How did the score know? It didn't. It predicted — using data from thousands of patients who were followed for decades. It found the mathematical relationship between risk factors (age, BP, cholesterol, smoking) and the outcome (heart attack). That mathematical relationship is called regression.
Every risk calculator you've ever used — Framingham, CURB-65, APACHE II, Wells score, CHA₂DS₂-VASc — is regression wearing a clinical disguise. Strip the friendly interface, and underneath is a regression equation that someone built from data.
If you don't understand regression, you don't understand how any of these tools were created, what their limitations are, or when they'll fail you.
Why Is It Called "Regression"? — The Most Confusing Name in Statistics
This is where every student gets stuck. The word "regression" in English means going backward, returning to a previous state, deterioration. In oncology, "regression" means a tumour is shrinking. In psychology, "regression" means reverting to childish behaviour.
So why does a statistical method for prediction — which is forward-looking — have a name that means going backward?
The answer is one of the most fascinating accidents in the history of science.
TERM DECONSTRUCTION: Regression
Word Surgery
- Re- = "back, again" (Latin re-)
- -gress- = "to step, to go" (Latin gressus, from gradi = "to walk")
- -ion = noun-forming suffix
- Literal meaning: "The act of stepping back"
Think of it like "progress" (stepping forward) vs "regress" (stepping backward). Same root, opposite prefixes.
Why This Name? Francis Galton coined it in 1886. He was NOT naming a statistical method. He was naming a biological phenomenon — the fact that offspring of extreme parents tend to be less extreme, "stepping back" toward the population average. The method he used to study this phenomenon inherited the phenomenon's name. It's like naming all telescopes "moon-scopes" because the first telescope was pointed at the moon.
The "Aha" Bridge So... the method called "regression" has nothing to do with going backward. It's named after the first thing it was used to study (traits regressing toward the mean), not what it does (predicting forward). The name describes the original discovery, not the tool.
Naming Family
- Progress (pro + gress = stepping forward)
- Digress (di + gress = stepping aside)
- Transgress (trans + gress = stepping across/beyond)
- Regression to the mean — the biological phenomenon that gave regression analysis its misleading name
Francis Galton's Sweet Peas (1877)
Francis Galton — Charles Darwin's half-cousin, polymath, obsessive measurer of everything — was studying heredity. He wanted to know: if a father is tall, will the son be equally tall?
He started with sweet pea seeds (easier than humans). He sorted parent seeds by size into seven groups, planted them, and measured the offspring seeds.
What he expected: Big parents → equally big offspring. Small parents → equally small offspring. A straight 1:1 relationship.
What he found: Big parents did produce bigger-than-average offspring — but NOT as big as the parents. Small parents produced smaller-than-average offspring — but NOT as small as the parents.
The offspring regressed toward the average. Extreme parents produced less-extreme children. Tall fathers had sons who were tall but not AS tall. Short fathers had sons who were short but not AS short.
Galton called this phenomenon "regression toward mediocrity" (later renamed "regression to the mean"). The offspring were "going back" toward the population average.
From Phenomenon to Method (1886)
Galton then studied this in humans. He collected heights of 928 adult children and their parents. He plotted parent height (x-axis) against child height (y-axis) and drew the best-fitting straight line through the data.
The slope of that line was less than 1.0 — confirming that children's heights "regressed" toward the population mean relative to their parents.
Galton called the line itself a "regression line" — because it described the regression-toward-the-mean phenomenon. The METHOD got named after the PHENOMENON it was first used to study.
This is like naming all microscopes "malaria-scopes" because the first microscope was used to study malaria. The name describes the first application, not the method itself.
Karl Pearson and the Mathematisation (1896-1903)
Karl Pearson, Galton's protege, took Galton's graphical method and made it rigorous. He developed the mathematical framework for fitting lines to data using least squares, calculated correlation coefficients, and formalised what we now call "linear regression."
Pearson kept Galton's name even though the method had nothing to do with "regressing" anymore. By then, regression was being used to predict crop yields, analyse economic data, and model physical phenomena — none of which involved regression toward the mean.
The name stuck. 130 years later, we're still using a word that means "going backward" for a method that predicts forward. The name is an accident of history, preserved by tradition, and confusing to every new learner.
Ronald Fisher (1920s-1930s) — The Full Framework
Fisher extended regression to multiple predictors (multiple regression), connected it to ANOVA, and developed the F-test for regression significance. He also introduced the concept of "analysis of covariance" (ANCOVA) — regression with categorical predictors alongside continuous ones.
Fisher's framework is the basis for virtually all clinical prediction models, including every risk score you use.
The Dictionary Problem — Why the Name Actively Harms Understanding
Let's look up "regression" in various dictionaries:
| Dictionary | Definition |
|---|---|
| Oxford English | "A return to a former or less developed state" |
| Merriam-Webster | "The act or an instance of regressing; reversion to an earlier mental or behavioral level" |
| Medical (Dorland's) | "1. A return to a former state. 2. Subsidence of symptoms. 3. In oncology, decrease in tumour size." |
| Statistical | "A method for modelling the relationship between a dependent variable and one or more independent variables" |
The statistical definition has NOTHING to do with the other three. A student who looks up "regression" in a regular dictionary will be more confused than before they looked. The word's everyday meaning (going backward) is the opposite of its statistical meaning (predicting forward).
Other Confusing Statistical Terms for Comparison
| Term | Everyday Meaning | Statistical Meaning | Confusion Level |
|---|---|---|---|
| Regression | Going backward | Predicting relationships | Catastrophic |
| Significant | Important, meaningful | Unlikely due to chance alone | High |
| Normal | Ordinary, healthy | Bell-shaped distribution | High |
| Power | Strength, authority | Probability of detecting a real effect | Moderate |
| Error | Mistake | Natural variability / residual | Moderate |
| Confidence | Self-assurance | Probability coverage of an interval | Moderate |
Regression wins the prize for most misleading statistical term. At least "significant" and "normal" have a distant connection to their statistical meanings. "Regression" has none.
What Regression Actually Does — No Jargon
The Core Idea
You have data. Each patient has several measurements (age, BP, cholesterol) and an outcome (heart attack yes/no, or BP value, or survival time).
You want to find the mathematical formula that best connects the measurements to the outcome.
That formula lets you:
- Predict the outcome for a new patient based on their measurements
- Identify which measurements matter most (and which are useless)
- Quantify how much each measurement contributes (e.g., "each 10 mmHg increase in SBP increases risk by 12%")
- Adjust for confounders (separate the effect of smoking from the effect of age when both are associated with heart disease)
The Analogy
Regression is like a recipe that takes ingredients (predictors) and tells you what dish (outcome) they'll produce.
- Ingredients: Age, BP, cholesterol, smoking, BMI
- Recipe (regression equation): Risk = 0.02 x Age + 0.015 x SBP + 0.008 x Cholesterol + 1.2 x Smoking - 3.5
- Dish: 10-year cardiac risk = 18%
The "recipe" was learned by studying thousands of patients where you knew both the ingredients AND the dish. The regression found the weights (0.02, 0.015, 0.008, 1.2) that best predicted the known outcomes. Now you apply those weights to a new patient.
Types of Regression — The Family Tree
Why So Many Types?
Different clinical questions require different types of outcomes. The type of outcome determines the type of regression:
| Your Outcome | Type | Regression Method | Who Developed It | Year |
|---|---|---|---|---|
| Continuous (BP, HbA1c, weight) | Numerical | Linear Regression | Galton/Pearson/Fisher | 1886-1930s |
| Binary (dead/alive, disease/no disease) | Yes/No | Logistic Regression | Joseph Berkson (Mayo Clinic) | 1944 |
| Time to event (survival time, time to relapse) | Censored time | Cox Proportional Hazards | David Cox | 1972 |
| Count (number of seizures, exacerbations) | Integer | Poisson Regression | Building on Poisson's distribution | 1900s |
| Ordinal (mild/moderate/severe, mRS 0-6) | Ordered categories | Ordinal Logistic Regression | Peter McCullagh | 1980 |
Linear Regression — The Ancestor
What it does: Finds the straight line (or plane, in multiple dimensions) that best predicts a continuous outcome from one or more predictors.
TERM DECONSTRUCTION: Linear
Word Surgery
- Line- = "line" (Latin linea = "linen thread, string, line")
- -ar = "relating to" (adjective suffix)
- Literal meaning: "Relating to a line"
Why This Name? Because this regression fits a straight line through data. In 2D, it's literally a line on a graph. In 3D, it's a flat plane. In higher dimensions, it's a "hyperplane." But the core idea is always: straight, not curved.
The "Aha" Bridge So... "linear" regression = the relationship between X and Y is modelled as a straight line. If you increase X by 1, Y always changes by the same fixed amount. No curves, no bends, no surprises. Straight-line thinking, mathematically.
Naming Family
- Non-linear regression — allows curves
- Curvilinear — curved line (curvi + linear)
- Collinear/Multicollinear — variables that lie along the same line (co + linear; more on this below)
The equation: Y = β₀ + β₁X₁ + β₂X₂ + ... + ε
Let's deconstruct every piece of this equation.
TERM DECONSTRUCTION: Coefficient (β₁, β₂...)
Word Surgery
- Co- = "together" (Latin co-)
- -effici- = "to make, to do" (Latin efficere = "to bring about")
- -ent = adjective suffix meaning "doing"
- Literal meaning: "Something that works together (with the variable) to produce a result"
Why This Name? In mathematics, a coefficient is a number that multiplies a variable. In 3x, the coefficient is 3 — it "works together with" x to produce the result. In regression, β₁ is the coefficient of X₁ because it multiplies X₁ to contribute to the prediction.
The "Aha" Bridge So... a regression coefficient tells you: "for every 1-unit increase in this predictor, the outcome changes by THIS much." It's the multiplier. The co-worker of the variable. β₁ = 0.015 for SBP means each 1 mmHg of blood pressure "co-operates" with the equation to add 0.015 units of risk.
Naming Family
- Correlation coefficient (r) — a standardised measure of association
- Coefficient of determination (R²) — proportion of variance explained
- Coefficient of variation (CV) — SD as percentage of mean
TERM DECONSTRUCTION: Intercept (β₀)
Word Surgery
- Inter- = "between" (Latin inter-)
- -cept = "to catch, to seize" (Latin capere = "to take")
- Literal meaning: "To catch between" — where something is caught/crossed
Why This Name? In geometry, the intercept is where a line crosses ("intercepts") an axis. The y-intercept is where the line crosses the y-axis. In regression, β₀ is the value of Y when ALL predictors are zero — the point where the regression line "intercepts" the y-axis.
The "Aha" Bridge So... the intercept is the starting point. Before any predictor kicks in. Like a restaurant bill before anyone orders food — there's already a fixed charge (cover charge = intercept). Everything you order (predictors x coefficients) adds to that base.
Naming Family
- Intercept (geometry: where a line crosses an axis)
- Contraception (contra + cept = catching against, preventing conception)
- Reception (re + cept = catching back, receiving)
TERM DECONSTRUCTION: Slope
Word Surgery
- From Middle English slope, possibly from Old English slupan = "to slip"
- Literal meaning: An incline, a slant
Why This Name? The slope of a line is how steeply it tilts. A steep slope = big change in Y for a small change in X. A flat slope = X changes a lot but Y barely moves. In regression, the coefficient IS the slope — it tells you the steepness of the relationship.
The "Aha" Bridge So... slope = coefficient in simple regression. If the slope is 2.5, the line "rises" 2.5 units for every 1 unit you "walk" along the x-axis. Flat slope = weak effect. Steep slope = strong effect. The steepness of the hill IS the strength of the relationship.
Think of it like walking in Shimla: some roads are flat (slope near 0, the predictor barely matters), some are steep (slope large, the predictor has a big effect on the outcome).
TERM DECONSTRUCTION: Residual (ε)
Word Surgery
- Re- = "back" (Latin)
- -sid- = "to sit, to settle" (Latin sedere = "to sit")
- -ual = adjective suffix
- Literal meaning: "What sits back" → what remains, what's left over
Why This Name? A residual is what's LEFT OVER after the regression has done its best. It's the difference between what the model predicted and what actually happened. Predicted BP = 140. Actual BP = 148. Residual = 8. That 8 "remains" unexplained.
The "Aha" Bridge So... residuals are the leftovers. The regression ate what it could explain; the residual is what's still on the plate. In a perfect model (which never exists in biology), all residuals would be zero. In reality, residuals represent everything you didn't measure, random biological noise, and sheer unpredictability of human biology.
Think of it like a restaurant bill. You can explain most of it (food + drinks + tax). But there's always some rounding, some miscellaneous charge you can't account for. That's the residual.
Naming Family
- Residue (chemistry: what remains after a reaction)
- Residency (originally: remaining/residing in a place)
- Error term — another name for the residual (though "error" suggests a mistake, which it isn't — it's natural variability)
Interpretation of coefficients: "For every 1 unit increase in X₁, Y changes by β₁ units, holding all other predictors constant." That last phrase — "holding all other predictors constant" — is the magic of regression. It lets you isolate the effect of one variable while controlling for others. This is why regression is the primary tool for confounding adjustment in observational studies.
TERM DECONSTRUCTION: Confounder
Word Surgery
- Con- = "together, with" (Latin con-)
- -found- = "to pour, to mix" (Latin fundere = "to pour")
- -er = agent noun suffix
- Literal meaning: "Something that pours together" → that which mixes things up, causes confusion
Why This Name? A confounder literally "con-founds" (pours together, mixes up) the effect of two variables so you can't tell them apart. Age confounds the smoking-heart disease relationship because age is "mixed in" with both smoking and heart disease. You can't separate smoking's effect from age's effect without statistical adjustment.
The "Aha" Bridge So... a confounder is a mixer. It takes two separate relationships and pours them into the same cup so you can't taste them individually. Regression is the tool that un-mixes them — it separates each ingredient's contribution even when they're mixed together.
Think of it like a detective investigating a crime. Two suspects were at the scene together (confounder = they're mixed up). The detective's job (regression) is to figure out who actually did what — separating the guilty from the merely present.
Naming Family
- Confuse (same root: con + fuse = pour together → mix up)
- Confound (to mix up variables)
- Foundry (where metals are poured/melted together)
- Covariate — a variable you "co-vary" with (adjust for) in the model
TERM DECONSTRUCTION: Covariate
Word Surgery
- Co- = "together, with" (Latin)
- -vari- = "to change" (Latin variare)
- -ate = noun/verb suffix
- Literal meaning: "Something that varies together with" the main variables
Why This Name? A covariate is a variable that co-varies (changes alongside) the predictor and the outcome. In regression, you include covariates to adjust for their influence. It's a neutral term — it doesn't say whether the variable is a confounder, a predictor, or something else. It just means "another variable in the model."
The "Aha" Bridge So... "covariate" is the polite, non-judgmental word for "another variable we're including in the regression." Whether it's a confounder (bad if ignored) or a precision variable (helps reduce noise), it's called a covariate once it enters the model.
Naming Family
- Variable (vari + able = something that can change)
- Variance (how much a variable changes)
- Covariance (how two variables change together)
Assumptions of Linear Regression:
- Linearity — the relationship between each X and Y is a straight line
- Independence — observations don't influence each other
- Normality of residuals — the errors (ε) are Gaussian
- Homoscedasticity — the spread of errors is constant across all values of X
- No multicollinearity — predictors aren't too highly correlated with each other
TERM DECONSTRUCTION: Multicollinearity
Word Surgery
- Multi- = "many" (Latin multus)
- Co- = "together" (Latin)
- -line- = "line" (Latin linea)
- -arity = noun suffix meaning "the quality of"
- Literal meaning: "The quality of many things lying along the same line together"
Why This Name? When two predictors are highly correlated, they're essentially measuring the same thing — they "line up" together. Imagine plotting predictor A against predictor B: if they're perfectly correlated, all points fall on a straight line (they're co-linear). Multi-collinearity = multiple predictors all lining up with each other.
The "Aha" Bridge So... multicollinearity means your predictors are stepping on each other's toes. If systolic BP and diastolic BP are highly correlated (r = 0.85), the regression can't figure out which one is responsible for the outcome — because when one goes up, the other almost always goes up too. It's like two singers singing in perfect unison: you can hear the song, but you can't tell who's contributing what.
Why it's a problem: The regression becomes unstable. Coefficients swing wildly. Standard errors inflate. You might conclude SBP is protective and DBP is harmful when they're just reflections of each other. The fix: drop one, or combine them.
Naming Family
- Collinear (two things on the same line)
- Multivariate (many variables — different concept, often confused with multicollinearity)
TERM DECONSTRUCTION: Least Squares
Word Surgery
- Least = smallest (Old English laest)
- Squares = squared values (squaring the residuals)
- Literal meaning: "The method that makes the sum of squared residuals as small as possible"
Why This Name? Adrien-Marie Legendre (1805) and Carl Friedrich Gauss (1809) independently developed this method. They wanted to find the line that fits data best. "Best" was defined as: minimise the total distance between data points and the line. But they squared each distance before adding them up.
Why squared? Three reasons:
- Squaring makes negative distances positive (a point below the line has a negative residual — squaring it makes it positive)
- Squaring punishes big errors more than small ones (a residual of 10 contributes 100, not 10)
- Squaring makes the mathematics solvable with calculus (smooth, differentiable function with one minimum)
The "Aha" Bridge So... "least squares" = find the line that makes the squared leftovers as small as possible. Think of it like parking a car between cones: you want to minimise how much you bump each cone, and bumping the far ones counts extra (squaring penalises big misses).
The method is called "ordinary least squares" (OLS) and it remains the engine inside every linear regression you run. "Ordinary" just means basic, no special modifications — to distinguish it from weighted least squares or generalised least squares.
Naming Family
- Ordinary Least Squares (OLS) — the default method
- Weighted Least Squares (WLS) — some residuals count more than others
- Generalised Least Squares (GLS) — handles correlated residuals
TERM DECONSTRUCTION: R-squared (R²)
Word Surgery
- R = the multiple correlation coefficient (uppercase R for multiple regression, lowercase r for simple correlation)
- Squared = multiplied by itself
- Literal meaning: "The square of the correlation between predicted and actual values"
Why This Name? In simple regression with one predictor, R² literally equals Pearson's r squared. In multiple regression, R is the correlation between the model's predicted values and the actual observed values. Square it and you get the proportion of variance in the outcome explained by the model.
The "Aha" Bridge So... R² = 0.65 means the regression model explains 65% of the variation in the outcome. The other 35% is unexplained — residual noise, unmeasured variables, biology being messy.
Think of it like a weather forecast. R² = 0.65 means the forecast explains 65% of temperature variation. Good enough to pack an umbrella or sunscreen, but don't bet your life on it.
Why it matters: R² = 0.10 means your model explains only 10% of the outcome. Ninety percent is noise. Your "significant" predictor is barely scratching the surface.
Naming Family
- Adjusted R² — penalises for adding too many predictors (prevents overfitting)
- r² (lowercase) — coefficient of determination for simple correlation
- Pseudo-R² — approximations of R² for logistic regression (which doesn't have a true R²)
Logistic Regression — The Clinical Workhorse
Why it exists: Linear regression predicts a number. But many clinical questions have yes/no outcomes. Will this patient die? Will the tumour recur? Does this patient have the disease?
You can't use linear regression to predict a probability because linear regression can produce values below 0 or above 1 — which probabilities can't be.
TERM DECONSTRUCTION: Logistic
Word Surgery
- Logist- = from the logistic curve (French logistique)
- -ic = "relating to"
- The word "logistic" here does NOT come from "logistics" (supply chain management). That's a different word entirely.
Why This Name? This is one of the most misunderstood term origins in statistics. The trail goes back to Pierre-Francois Verhulst (1838), a Belgian mathematician studying population growth.
Verhulst noticed that populations don't grow exponentially forever — they hit a carrying capacity and plateau. The growth curve is S-shaped (sigmoid). He needed a name for this curve.
Verhulst called it the "courbe logistique" (logistic curve). Why "logistique"? Scholars still debate this. The most likely explanation: Verhulst derived it from the Greek logistikos (λογιστικός) meaning "skilled in calculation/reasoning." Some argue he was contrasting it with the "logarithmic" curve (exponential growth), implying the logistic curve is the "reasonable, calculated" version of growth that accounts for limits.
A century later, Joseph Berkson at the Mayo Clinic (1944) borrowed Verhulst's S-shaped curve to model probabilities. He needed a function that squeezes any number into the 0-1 range. The logistic function does exactly that:
Probability = 1 / (1 + e^(-z))
Because the function came from the logistic curve, the regression that uses it became "logistic regression."
The "Aha" Bridge So... "logistic" regression is named after a mathematical curve from 1838 that was itself named for unclear reasons. The key insight: the logistic function is an S-shaped curve that converts any number into a probability between 0 and 1. That's what makes it perfect for yes/no outcomes. The name is historical baggage, not a description of what it does.
If you remember one thing: logistic = S-curve = probability machine.
Naming Family
- Logistic curve / Sigmoid curve — the S-shaped function (sigmoid from Greek sigma = S-shaped)
- Logit — the LOG of the ODDS (log + it); the inverse of the logistic function. "Logit regression" and "logistic regression" mean the same thing.
- Probit — PROBability unIT; an alternative to logit using the normal distribution instead of the logistic
- Logistics (supply chain) — completely unrelated despite identical spelling
What logistic regression produces: Not the outcome directly, but the odds ratio.
TERM DECONSTRUCTION: Odds Ratio
Word Surgery
- Odds = from Old Norse oddr = "point of a triangle, the odd (uneven) one." In gambling, "odds" meant the unequal chances — the ratio of success to failure. Odds of 3:1 means 3 chances of winning for every 1 of losing.
- Ratio = from Latin ratio = "reckoning, calculation, reason"
- Literal meaning: "The ratio of one set of odds to another"
Why This Name? The odds ratio compares the odds of an outcome in two groups. Odds in the exposed group divided by odds in the unexposed group. If smoking gives you 4:1 odds of lung cancer and not-smoking gives you 1:1 odds, the odds ratio = 4/1 = 4.
In logistic regression, each coefficient (β) represents the log-odds ratio for that predictor. Exponentiate it (e^β) and you get the odds ratio.
The "Aha" Bridge So... the odds ratio is a comparison of chances. An OR of 2.5 means the odds are 2.5 times higher in the exposed group. Each coefficient in logistic regression is secretly an odds ratio in disguise — just take e^β to unmask it.
"Smoking has an OR of 2.5 for lung cancer after adjusting for age, sex, and occupational exposure" — that "after adjusting for" comes from logistic regression. The regression separated smoking's effect from the confounders.
Naming Family
- Relative Risk (RR) — ratio of probabilities (not odds); used in cohort studies and RCTs
- Hazard Ratio (HR) — ratio of instantaneous risks; used in survival analysis
- Odds vs Probability — odds = p/(1-p); probability = successes/total. They're related but NOT the same. When disease is rare, OR approximates RR. When common, they diverge.
This is the most commonly used regression in clinical research. Every case-control study, every diagnostic accuracy study, every risk factor study uses logistic regression.
Cox Proportional Hazards — The Survival Analysis Engine
Why it exists: Many clinical outcomes are time-to-event — not just "did they die?" but "when did they die?" Linear regression can't handle censoring (patients lost to follow-up). Logistic regression ignores timing.
David Cox (1972), a British statistician, developed a regression model specifically for survival data. His insight: you don't need to know the actual shape of the survival curve (the "baseline hazard"). You only need to estimate how much each predictor multiplies the hazard (risk per unit time).
TERM DECONSTRUCTION: Cox Proportional Hazards
Word Surgery — piece by piece:
Cox — simply named after Sir David Cox (1924-2022), British statistician. His 1972 paper is one of the most cited in all of statistics (80,000+ citations). Refreshingly honest naming for once — the method is named after the person, not some obscure Latin root.
Proportional:
- Pro- = "in front of, in favour of, in proportion to" (Latin)
- -portion- = "a share, a part" (Latin portio)
- -al = adjective suffix
- Literal meaning: "Maintaining the same ratio/share"
Hazard:
- From Old French hasard = "game of dice" (originally from Arabic az-zahr = "the die")
- In survival analysis, "hazard" = the instantaneous risk of the event at any given moment, given you've survived until that moment. It's a rate, not a probability.
Why This Name? The model assumes that the ratio of hazards between two groups stays constant (proportional) over time. If Drug A halves the hazard compared to placebo (HR = 0.5), it halves it at 1 month, 6 months, and 5 years equally. The hazards maintain the same proportion.
The "Aha" Bridge So... "proportional hazards" = the relative risk stays the same at every time point. The groups' danger levels may both increase over time, but the RATIO between them stays fixed. It's like two cars on a highway: both speed up, both slow down, but one is always exactly twice as fast as the other. That constant "twice as fast" is the hazard ratio.
Naming Family
- Hazard function (h(t)) — the instantaneous risk at time t
- Baseline hazard (h₀(t)) — the hazard when all predictors are zero
- Hazard ratio (HR) — the multiplicative effect of a predictor on the hazard
TERM DECONSTRUCTION: Hazard Ratio
Word Surgery
- Hazard = instantaneous risk (from the dice-game origin above)
- Ratio = comparison by division
Why This Name? It's the ratio of hazards in two groups. HR = hazard in treatment / hazard in control. An HR of 0.72 means the treatment group's instantaneous risk is 72% of the control group's — a 28% reduction.
The "Aha" Bridge So... HR = 0.72 means "at any given moment, patients on treatment have 28% less danger of the event compared to control." It's a moment-by-moment comparison of danger levels.
Key nuance: HR is NOT "28% fewer deaths." It's "28% lower rate of dying at any given instant." These are different concepts. The cumulative effect depends on how long you follow patients.
This is the primary analysis method for every oncology trial (OS, PFS), every cardiovascular outcomes trial, and every survival study. When a paper reports "HR 0.65, 95% CI 0.52-0.81, p<0.001" — that came from a Cox model.
Poisson Regression — The Count Modeller
TERM DECONSTRUCTION: Poisson Regression
Word Surgery
- Poisson = named after Simeon Denis Poisson (1781-1840), a French mathematician
- The word "Poisson" literally means "fish" in French (yes, really)
- Poisson developed the probability distribution for counting rare events in fixed intervals
Why This Name? Poisson published his distribution in 1837 to model the number of wrongful convictions in France. The distribution was later applied to deaths from horse kicks in the Prussian army (Ladislaus Bortkiewicz, 1898) — a classic example of counting rare events.
Poisson regression uses this distribution to model count outcomes: number of seizures per month, number of asthma exacerbations per year, number of hospital admissions per quarter.
The "Aha" Bridge So... Poisson regression is named after a Frenchman whose last name means "fish," who studied how often innocent people were convicted. The distribution he created is perfect for counting how many times something happens in a fixed period. Count data → Poisson regression. The fish connection is pure coincidence.
Naming Family
- Negative binomial regression — handles count data that's more spread out (overdispersed) than Poisson allows
- Zero-inflated Poisson — for count data with excess zeros
- Rate ratio — the output of Poisson regression (analogous to OR in logistic, HR in Cox)
The Regulatory Dimension
FDA and Regression — It's Everywhere
1. Primary Analysis of Pivotal Trials
Almost every pivotal trial's primary analysis uses regression in some form:
| Endpoint Type | FDA-Expected Primary Analysis | Regression Under the Hood |
|---|---|---|
| Continuous (HbA1c change, BP) | ANCOVA or MMRM | Linear regression with baseline as covariate |
| Binary (response rate, cure rate) | Logistic regression or CMH test | Logistic regression for adjusted analysis |
| Time-to-event (OS, PFS) | Cox proportional hazards | Cox regression |
| Count (exacerbation rate) | Negative binomial regression | Poisson/NB regression |
| Repeated measures (longitudinal) | MMRM | Mixed-effects linear regression |
TERM DECONSTRUCTION: ANCOVA
Word Surgery
- AN = ANalysis
- CO = of COvariance
- VA = (part of coVAriance)
- Full form: Analysis of Covariance
Why This Name? ANOVA (Analysis of Variance) compares group means. ANCOVA adds a covariate — a continuous variable you adjust for. It's ANOVA + regression combined. The name emphasises that it analyses the covariance (co-variation) between the outcome and the covariate.
The "Aha" Bridge So... ANCOVA IS regression. When a paper says "primary analysis was ANCOVA with treatment and baseline as covariates" — that's a linear regression with treatment group as a categorical predictor and baseline value as a continuous predictor. ANCOVA is not a separate method. It's regression with a specific covariate structure. The name just sounds fancier.
Think of it like "biryani" vs "spiced rice with meat." Same dish, different name. ANCOVA is regression wearing a traditional outfit.
Naming Family
- ANOVA (Analysis of Variance — regression without covariates, just groups)
- MANOVA (Multivariate ANOVA — multiple outcomes)
- MANCOVA (Multivariate ANCOVA)
TERM DECONSTRUCTION: MMRM
Word Surgery
- M = Mixed
- M = Model for
- R = Repeated
- M = Measures
- Full form: Mixed Model for Repeated Measures
Why This Name?
- Mixed = the model has both "fixed effects" (treatment, visit — things you're studying) and "random effects" (patient-level variation — things that differ randomly between people)
- Repeated Measures = the same patient is measured multiple times (week 4, week 8, week 12)
The "Aha" Bridge So... MMRM IS regression. It's a linear regression that accounts for the fact that measurements from the same patient are correlated (my week-4 BP is related to my week-8 BP because I'm the same person). The "mixed" part handles this within-patient correlation.
MMRM is the standard analysis for every psychiatric drug trial (HAM-D, PANSS, MADRS change scores) and most longitudinal trials with continuous outcomes.
If you understand regression, you understand 80% of how pivotal trial data is analysed.
2. Confounding Adjustment in Observational Studies
FDA increasingly accepts real-world evidence (RWE) from observational data. But observational data has confounding — treatment groups differ in baseline characteristics.
Regression is the primary tool for adjusting these confounders. When an FDA reviewer asks "did you adjust for age, sex, comorbidities, and disease severity?" — they're asking "did you include these as covariates in your regression model?"
ICH E9 R1 (Estimands): The estimand framework requires specifying how intercurrent events are handled. Different intercurrent event strategies (treatment policy, hypothetical, while-on-treatment) require different regression model specifications. The estimand determines the regression.
3. Subgroup Analysis and Treatment-by-Covariate Interactions
When FDA asks "is the treatment effect consistent across subgroups?" — they're asking about interaction terms in regression.
Outcome = β₀ + β₁(Treatment) + β₂(Age) + β₃(Treatment x Age) + ε
If β₃ is significant, the treatment effect varies by age — the drug works differently in young vs old patients. This interaction term in the regression is how the FDA evaluates subgroup consistency.
4. Dose-Response Modelling
FDA's dose-finding guidance (ICH E4) expects regression modelling of the dose-response relationship. Does efficacy increase linearly with dose? Plateau? Follow an Emax curve? These are regression models with different functional forms.
5. Propensity Score Methods
TERM DECONSTRUCTION: Propensity Score
Word Surgery
- Pro- = "forward, toward" (Latin)
- -pens- = "to hang, to weigh, to lean toward" (Latin pendere = "to hang/weigh")
- -ity = noun suffix meaning "the quality of"
- Score = a numerical value
- Literal meaning: "A score reflecting the tendency (leaning) toward something"
Why This Name? Coined by Paul Rosenbaum and Donald Rubin (1983). The propensity score is each patient's propensity (tendency, inclination) to receive the treatment, based on their observed characteristics. A patient who is older, sicker, and has more comorbidities has a higher propensity to receive aggressive treatment.
The "Aha" Bridge So... the propensity score asks: "How likely was this patient to get the treatment, based on their profile?" It's the predicted probability from a logistic regression of treatment assignment on baseline covariates. Then you match or adjust patients with similar propensity scores, creating a quasi-randomised comparison.
Regression builds the propensity score. Regression adjusts for confounding using the propensity score. Regression everywhere.
Naming Family
- Propensity score matching (PSM) — matching treated and control patients with similar scores
- Inverse probability of treatment weighting (IPTW) — weighting patients by 1/propensity score
- Propensity = tendency, inclination (same root as "pensive" = thoughtful, weighing things)
Branch-by-Branch — Where Regression Matters
General Medicine
Where you use it without knowing: Every risk calculator.
- Framingham Risk Score = logistic regression on Framingham cohort data
- CURB-65 = simplified logistic regression for pneumonia mortality
- CHA₂DS₂-VASc = simplified logistic regression for stroke risk in AF
- Wells Score = simplified logistic regression for PE probability
The catch you don't know about: These scores were developed on specific populations. The Framingham score was built on a mostly white, middle-class Massachusetts cohort. The regression coefficients (weights) reflect THAT population's risk factor distributions. Apply it to a South Asian population with different baseline rates of diabetes, different lipid profiles, different genetic risk — and the regression equation miscalibrates. It may under- or over-estimate risk systematically.
This is why QRISK was developed for the UK, and why region-specific risk scores exist. The regression is only as good as the population it was trained on. This is a regression concept (external validity / calibration) that most clinicians never learn.
Surgery
Where it matters: Predicting post-operative complications.
The ASA score is a crude predictor. Modern surgical risk prediction uses regression models with dozens of variables: age, BMI, ASA class, surgery type, duration, emergency vs elective, comorbidities, functional status, lab values.
P-POSSUM (Physiological and Operative Severity Score for the Enumeration of Mortality and Morbidity) uses logistic regression to predict morbidity and mortality from 12 physiological and 6 operative variables.
What you miss if you don't understand regression: P-POSSUM overpredicts mortality in low-risk groups and underpredicts in high-risk groups. This is a known regression problem called miscalibration. The model's predicted probabilities don't match actual observed rates. A surgeon who knows this interprets P-POSSUM critically. One who doesn't takes the number as truth.
Paediatrics
Where it matters: Growth chart construction.
Growth charts (WHO, CDC, IAP) are built using quantile regression — a variant that estimates how the percentiles (not just the mean) of weight/height change with age. The smooth curves on the growth chart are regression lines fitted to thousands of children's longitudinal data.
The LMS method (Lambda-Mu-Sigma) used for growth charts is a regression technique that models how the distribution of a measurement changes with age — not just the mean, but the skewness and spread.
Obstetrics
Where it matters: Pre-eclampsia prediction models.
First-trimester screening for pre-eclampsia (the FMF model by Nicolaides) uses logistic regression combining:
- Maternal factors (age, BMI, parity, history)
- Mean arterial pressure
- Uterine artery pulsatility index
- Serum biomarkers (PAPP-A, PlGF)
The regression equation combines these into a single predicted probability of developing pre-eclampsia before 37 weeks.
The regression assumption most often violated: The FMF model assumes the predictors have the SAME regression coefficients in all populations. But uterine artery PI distributions differ between ethnicities. PlGF levels differ. The model, developed in European populations, may miscalibrate in South Asian or African populations. This is why local validation studies are essential before adopting any prediction model.
Psychiatry
Where it matters: The primary analysis of almost every psychiatric drug trial.
MMRM — the standard analysis for HAM-D, PANSS, MADRS change scores — is a regression model. It includes:
- Treatment group (the variable you care about)
- Visit (time)
- Treatment x Visit interaction (does the drug effect change over time?)
- Baseline score (covariate)
- Stratification factors
What you miss if you don't understand regression: When a paper says "MMRM analysis showed significant treatment-by-visit interaction, p=0.02" — that means the drug separated from placebo at some visits but not others. The treatment effect isn't constant over time. This interaction term is the most informative result in the regression, but most clinicians skip over it because they don't know what it means.
Community Medicine / PSM
Where it matters: Identifying risk factors for disease.
Every epidemiological study that reports "adjusted odds ratios" used logistic regression. Every study that reports "adjusted hazard ratios" used Cox regression.
"After adjusting for age, sex, SES, smoking, and BMI, the OR for diabetes associated with physical inactivity was 2.1 (95% CI 1.4-3.2)."
That entire sentence is the output of ONE logistic regression. The "adjusting for" phrase = including those variables as covariates in the model. The OR = exponentiated regression coefficient. The CI = based on the standard error of the coefficient.
The trap: "Adjusting for" in regression only works for confounders that are measured and included. Unmeasured confounders (genetic factors, unmeasured exposures, unmeasured behaviours) cannot be adjusted for. A regression with 10 covariates does NOT mean all confounding has been removed. It means confounding by those 10 variables has been addressed. Everything unmeasured remains.
This is why observational studies, no matter how sophisticated their regression, cannot prove causation as cleanly as an RCT (which addresses ALL confounders, measured and unmeasured, through randomisation).
Orthopaedics
Where it matters: Implant survival analysis.
Joint registry data is analysed using Cox regression to identify predictors of implant failure:
"After adjusting for age, sex, BMI, implant type, fixation method, and surgeon volume, cemented TKR had a significantly lower revision rate than uncemented (HR 0.72, 95% CI 0.61-0.85)."
That HR came from a Cox regression. Every variable in the "adjusted for" list was a covariate. The HR of 0.72 means 28% lower risk of revision.
The proportional hazards assumption is frequently violated in implant survival. Early failures (infection, instability) have different risk factors than late failures (wear, loosening). The hazard ratio may not be constant over time. If the Cox assumption is violated, the HR is a weighted average that may not represent reality at any specific time point. A Schoenfeld residual test should be reported — it almost never is.
The 6 Ways Not Knowing Regression Destroys You
1. You use risk scores without understanding their limitations
Every risk score is a regression equation. If you don't know that regression equations are built on specific populations, you'll blindly apply Framingham to a 40-year-old Indian woman when the equation was built on 50-year-old American men. The score will give you a number. The number will be wrong. You'll make clinical decisions on a miscalibrated prediction.
2. You can't read the primary analysis of any clinical trial
ANCOVA, MMRM, logistic regression, Cox regression — these are the four primary analysis methods used in >90% of pivotal trials. If you don't understand regression, you can read the abstract ("p=0.02, drug works") but you cannot evaluate HOW they got there, whether the model was appropriate, whether the assumptions held, or whether the conclusion is robust.
3. You don't understand "adjusted" vs "unadjusted" results
A paper reports:
- Unadjusted OR for smoking and heart disease: 4.5
- Adjusted OR (for age, sex, BP, diabetes): 2.8
Why did the OR shrink? Because some of smoking's apparent effect was actually confounding by age (older people smoke more AND have more heart disease). The regression separated smoking's independent effect from age's effect.
If you don't understand regression, you don't understand what "adjustment" means, what it can and cannot do, and why the adjusted and unadjusted results differ.
4. You misinterpret coefficients
A regression coefficient of β = 0.15 for "years of education" predicting income does NOT mean that one more year of education causes a 0.15 increase in income. It means that in the observed data, after adjusting for other variables in the model, each additional year of education is associated with a 0.15 unit change in the outcome.
Regression coefficients are associations, not causal effects (unless the study design supports causal inference — RCT or valid instrumental variable).
TERM DECONSTRUCTION: Overfitting
Word Surgery
- Over- = "too much, excessively" (Old English)
- -fitting = "making fit, adjusting to" (from "fit" = to be the right shape)
- Literal meaning: "Fitting too much" — making the model adjust excessively to the training data
Why This Name? Imagine buying a suit. A well-fitted suit follows your body shape. An OVERFITTED suit follows every wrinkle, every crease, every temporary bulge from the samosa you just ate. It looks perfect on you today — but put it on someone else (a new patient) and it's a disaster.
The "Aha" Bridge So... overfitting = the model memorised the noise in the training data instead of learning the true signal. It performs brilliantly on the data it was built on and terribly on new data. The model is too smart for its own good.
A regression with 20 predictors on 50 patients will overfit. There aren't enough patients to reliably estimate 20 relationships. The model will find patterns that are pure chance — like concluding that patients whose names start with 'S' have better outcomes.
Rule of thumb: You need at least 10-15 observations per predictor in the model. 5 predictors → need at least 50-75 patients.
Naming Family
- Underfitting — model too simple, misses real patterns
- Bias-variance tradeoff — underfitting (high bias) vs overfitting (high variance)
- Regularisation (Lasso, Ridge) — techniques that prevent overfitting by shrinking coefficients
5. You can't evaluate whether confounding was adequately addressed
When a paper says "we adjusted for age, sex, and BMI" — is that enough? What about smoking? SES? Medication use? Comorbidities?
The choice of covariates in a regression model is a scientific judgment call. Including too few = residual confounding. Including too many irrelevant ones = overfitting (the model memorises noise in the data rather than learning the true relationship). Including intermediates on the causal pathway = collider bias.
If you don't understand regression, you can't evaluate these choices. You accept whatever the authors included and trust their adjusted results, which may be badly confounded or badly overfitted.
6. You can't design your own thesis analysis
Your thesis has a continuous outcome, three predictors, and two confounders. What analysis do you run?
If you understand regression: multiple linear regression with confounders as covariates. Check assumptions (linearity, normality of residuals, homoscedasticity, multicollinearity). Report coefficients, CIs, p-values, R².
If you don't: you ask your statistician to "run something" and get back a table you can't interpret, with coefficients you don't understand, from a model whose assumptions you didn't check.
The One Thing to Remember
Regression is a mathematical recipe finder. You give it ingredients (predictors) and known dishes (outcomes from past patients), and it figures out the recipe (the equation connecting ingredients to dish). Then you use that recipe to predict the dish for a new patient whose ingredients you know.
Every risk score is regression. Every "adjusted" analysis is regression. Every clinical trial's primary analysis is regression. Every time a paper says "after controlling for," that's regression.
The name is misleading — it has nothing to do with going backward. Blame Galton and his sweet peas. The method is forward-looking: it takes what you know about a patient and predicts what you don't.
The resident who understands regression reads papers differently. When they see "adjusted OR 2.1 (95% CI 1.4-3.2)" they know: this came from a logistic regression, the 2.1 is an exponentiated coefficient, the adjustment was for specific covariates that may or may not be sufficient, and the conclusion depends on the model's assumptions being met.
The resident who doesn't understand regression sees a magic number and either believes it or ignores it. Neither response is medicine. Both are faith.