Back to DNB Tips
Stateazy40 min read12 April 2026

Two Things Move Together. So What?

Correlation — the most abused concept in medicine. Pearson, Spearman, partial, and why "correlation does not imply causation" is only half the story.

Stateazy Series

Two Things Move Together. So What?

The Problem First

You're an internal medicine resident. A pharmaceutical rep shows you a graph. On the x-axis: hours of yoga per week. On the y-axis: HbA1c levels. The dots slope downward beautifully. r = -0.72. p < 0.001.

The rep's pitch: "Yoga significantly reduces HbA1c. Correlation of -0.72. Highly significant."

You're almost convinced. Then something itches. You ask: "Was this an RCT?"

"No, a cross-sectional survey of 400 adults."

Now think. Who does yoga regularly? Health-conscious people. Who are health-conscious people? People who also eat better, exercise more, take their medications on time, have higher income, have better access to healthcare, are less likely to smoke.

The correlation between yoga and HbA1c might be real — but the CAUSE of lower HbA1c could be any of those confounders. The yoga might contribute nothing. The correlation might be entirely explained by the fact that yoga-doers are the same people who do everything else right.

r = -0.72 told you two things move together. It told you absolutely nothing about whether one caused the other. And the pharma rep either didn't know the difference or hoped you didn't.

"Correlation does not imply causation" is the most quoted and least understood sentence in medical statistics.


What Does "Correlation" Actually Mean?

Before we look at any formula, let's deconstruct the word itself.


TERM DECONSTRUCTION: Correlation

Word Surgery

  • Cor- = "together, with" (Latin cor-, variant of con-)
  • -relat- = "to carry back, to relate" (Latin relatio, from referre = "to carry back")
  • -ion = noun-forming suffix
  • Literal meaning: "A relating together" — two things connected, carried together

Why This Name? Francis Galton coined it in 1888 as "co-relation" (hyphenated). He was studying heredity — do tall fathers have tall sons? He needed a word for "the degree to which two measurements relate together." He picked "co-relation" because the "co" (together) emphasised that it's about TWO variables relating simultaneously.

Karl Pearson dropped the hyphen in 1896: "co-relation" became "correlation."

The "Aha" Bridge So... correlation literally means "relating together." That's all it measures: do these two things move in sync? The word says nothing about WHY they relate — just that they do. The name is honest: it measures co-relation, not co-causation.

Naming Family

  • Co-relation (Galton's original hyphenated form)
  • Auto-correlation (a variable correlating with itself over time; auto = self)
  • Cross-correlation (correlation between two different time series)
  • Partial correlation (correlation after removing the effect of a third variable)

DictionaryDefinition
Oxford English"A mutual relationship or connection between two or more things"
Merriam-Webster"The state or relation of being correlated; specifically: a relation existing between phenomena or things that tend to vary, be associated, or occur together"
Latin rootcor- (together) + relatio (relation) = "a relating together"
Statistical"A standardised measure of the linear association between two variables, ranging from -1 to +1"

The everyday meaning is broad — any connection between things. The statistical meaning is narrow — specifically a linear association, specifically standardised to a -1 to +1 scale.

Why the Name Is Confusing

The word "correlation" in everyday language implies a meaningful connection. "There's a correlation between poverty and crime" suggests a deep, possibly causal relationship.

In statistics, correlation is just a number measuring co-movement. It has no opinion about meaning, mechanism, or causation. You can calculate a correlation between the number of pirates and global temperature (they're negatively correlated — as pirates decreased over centuries, temperature rose). The number is mathematically correct. The implication that pirates prevent global warming is absurd.

The confusion: everyday "correlation" implies significance and meaning. Statistical "correlation" is a measurement tool that implies neither. When a paper says "we found a significant correlation," a clinician's brain hears "meaningful connection" when it should hear "two numbers moved together in our sample, and the co-movement was unlikely to be due to chance alone."


The History — A Story of Cousins, Stars, and Sweet Peas

Francis Galton (1888) — The Inventor

The same Galton who gave us regression invented correlation — and for the same reason: studying heredity.

After discovering regression to the mean, Galton needed a number to express how strongly two traits were related. Were father's height and son's height weakly related or strongly related? Was arm length related to leg length? Was head size related to intelligence? (Galton was a committed eugenicist — his science was better than his ethics.)

In 1888, Galton published a paper introducing the concept of an "index of co-relation" — a single number between -1 and +1 that captured how tightly two variables moved together. He called it "co-relation" (hyphenated) because it measured the degree to which two measurements related together (co = together, relation = connection).

Galton's original method was graphical. He plotted two variables, drew an ellipse around the data cloud, and used the shape of the ellipse to estimate the strength of association. A thin, tilted ellipse = strong correlation. A fat, circular blob = weak correlation.

Karl Pearson (1896) — The Formula

Galton gave us the concept. Karl Pearson gave us the mathematics.

In 1896, Pearson published the mathematical formula for what he called the "product-moment correlation coefficient" — the formal name for what we now call Pearson's r or just "the correlation coefficient."


TERM DECONSTRUCTION: Pearson's r

Word Surgery

  • Pearson's = named after Karl Pearson (1857-1936), English mathematician and biostatistician. He was Galton's protege, founded the world's first statistics department (University College London), and created many of the tools we still use (chi-square test, standard deviation as a term, correlation coefficient formula).
  • r = the lowercase letter chosen by convention. Why r? Galton had used "r" for "regression" (as in the regression coefficient). When Pearson standardised the correlation, he kept "r" — likely because correlation and regression were born from the same research programme.

Naming Family

  • r = Pearson's product-moment correlation (parametric, for continuous normal data)
  • R (uppercase) = multiple correlation coefficient (regression with multiple predictors)
  • rho (ρ) = Spearman's rank correlation (non-parametric)
  • tau (τ) = Kendall's rank correlation (non-parametric)

TERM DECONSTRUCTION: Product-Moment

Word Surgery

  • Product = the result of multiplication (Latin productum = "something brought forth")
  • Moment = a statistical summary measure (borrowed from physics: "moment of inertia")
  • In physics, a "moment" measures how force is distributed around a point
  • In statistics, "moments" summarise distributions: 1st moment = mean, 2nd moment = variance, etc.
  • Literal meaning: "A measure based on the products of deviations (moments) from the mean"

Why This Name? The formula multiplies (takes the "product" of) each X's deviation from its mean by each Y's deviation from its mean. These products are then averaged and standardised. The "moment" part comes from the fact that deviations from the mean are the raw material of statistical moments (variance, skewness, etc.).

The "Aha" Bridge So... "product-moment correlation coefficient" literally means: "a standardised measure based on multiplying together how far X and Y each deviate from their averages." When both deviate in the same direction (both above average, or both below), the product is positive. When they deviate in opposite directions, the product is negative. Sum these products, standardise, and you get r.

Think of it like this: Imagine a class of students. For each student, note whether they're above or below average in both height and weight. If tall students tend to be heavy (both deviations same sign), products are positive, r is positive. If tall students tend to be light (deviations opposite), products are negative, r is negative. If there's no pattern, products cancel out, r is near zero.


The Formula

r = Sigma[(xi - x_bar)(yi - y_bar)] / [(n-1) x sx x sy]

In words:

  1. For each patient, calculate how far X is from its mean and how far Y is from its mean
  2. Multiply these two deviations together (the "product")
  3. Sum all these products
  4. Divide by (n-1) x SD of X x SD of Y (the standardisation)

When both variables tend to be above their means at the same time → products are mostly positive → r is positive → positive correlation.

When one variable tends to be above its mean when the other is below → products are mostly negative → r is negative → negative correlation.

When there's no pattern → positive and negative products cancel out → r is near zero → no linear correlation.


Charles Spearman (1904) — The Rank Alternative

Charles Spearman, a British psychologist studying intelligence, faced a problem: his data was ordinal (rankings of students by different teachers), not continuous. Pearson's formula assumes continuous, Gaussian data.

Spearman developed a rank-based correlation: convert both variables to ranks (1st, 2nd, 3rd...), then apply Pearson's formula to the ranks.


TERM DECONSTRUCTION: Spearman's rho (ρ)

Word Surgery

  • Spearman's = named after Charles Edward Spearman (1863-1945), British psychologist. He was an army officer who became interested in psychology, studied under Wilhelm Wundt in Leipzig, and is famous for two things: the rank correlation coefficient AND the theory of general intelligence ("g factor").
  • rho (ρ) = the Greek letter chosen to distinguish from Pearson's r. Just as Pearson used the Latin letter r, Spearman used the Greek equivalent. The Greek ρ (rho) was likely chosen because it's the Greek version of "r" (both represent the same sound).

Why This Name? Spearman needed a symbol different from Pearson's r to signal: "This is NOT the same measure. This one uses ranks." The Greek ρ served as that flag.

The "Aha" Bridge So... when you see ρ (rho), think "ranks." Spearman's ρ is Pearson's r applied to ranked data. It measures monotonic relationships (one variable consistently goes up as the other goes up — even if the relationship isn't a straight line). Pearson's r only catches straight-line (linear) relationships.

Naming Family

  • r = Pearson (parametric, linear, continuous data)
  • ρ = Spearman (non-parametric, monotonic, ranked data)
  • τ = Kendall (non-parametric, concordance-based, ranked data)

TERM DECONSTRUCTION: Monotonic

Word Surgery

  • Mono- = "one, single" (Greek monos)
  • -tonic = "tone, tension, direction" (Greek tonos = "stretching, tension")
  • Literal meaning: "One tone" → going in one direction

Why This Name? In music, "monotonic" means a single pitch — no ups and downs. In mathematics, a monotonic relationship goes in one direction only: either always increasing or always decreasing. It can curve, it can accelerate, it can slow down — but it never reverses direction.

The "Aha" Bridge So... Spearman's ρ captures monotonic relationships: as X goes up, Y consistently goes up (or consistently goes down). The relationship doesn't have to be a straight line — it just can't reverse. A curved-but-always-increasing relationship is monotonic. A U-shaped relationship is NOT monotonic (it reverses).

Example: Drug dose vs effect might be monotonic (more dose → more effect, even if the curve flattens). But drug dose vs side effects might be non-monotonic (low dose = few side effects, medium dose = many, high dose = patient stops taking it = fewer reported side effects).

Naming Family

  • Monotone (constant direction)
  • Non-monotonic (reverses direction — U-shaped, inverted-U)
  • Linear (special case of monotonic: straight line, constant rate)

When to use which:

SituationUse
Both variables continuous AND normally distributedPearson's r
One or both variables ordinal (ranks, scores, scales)Spearman's ρ
Data is skewed or has outliersSpearman's ρ
You want to detect non-linear monotonic relationshipsSpearman's ρ
You want to detect strictly LINEAR relationshipsPearson's r

Maurice Kendall (1938) — The Third Option


TERM DECONSTRUCTION: Kendall's tau (τ)

Word Surgery

  • Kendall's = named after Sir Maurice Kendall (1907-1983), British statistician who made fundamental contributions to rank statistics and time series analysis
  • tau (τ) = the Greek letter chosen to distinguish from both r (Pearson) and ρ (Spearman). Tau is the 19th letter of the Greek alphabet.

Why This Name? Kendall needed yet another symbol. With r taken by Pearson and ρ by Spearman, he used τ. The choice of successive Greek/Latin letters for successive correlation methods is a convention, not a deep reason.

How it differs: While Spearman ranks the data and then applies Pearson's formula, Kendall counts concordant and discordant pairs. For every pair of observations, ask: "Do X and Y move in the same direction?" If yes, it's concordant. If no, it's discordant. τ = (concordant - discordant) / total pairs.

The "Aha" Bridge So... Kendall's τ asks a simpler question than Spearman: "For any two patients, if Patient A has a higher X than Patient B, does Patient A also have a higher Y?" If this is consistently true, τ is high. It's more intuitive, more robust with small samples, and has better statistical properties for hypothesis testing — but is less commonly used in medical research because Spearman is "good enough" and more familiar.

Naming Family

  • r (Pearson) — parametric, linear
  • ρ (Spearman) — non-parametric, rank-based, monotonic
  • τ (Kendall) — non-parametric, concordance-based, monotonic
  • Three tools for the same broad question: "How strongly are these two variables related?" Each captures slightly different aspects. The existence of three measures is not redundancy — it's the field acknowledging that "how strongly are these related?" is a harder question than it appears.

What the Number Actually Means

The Scale

Value of rStrengthWhat It Looks Like
+1.0Perfect positiveEvery dot falls on a line going up-right. Rare in biology.
+0.7 to +0.9Strong positiveClear upward trend, dots cluster around the line
+0.4 to +0.7Moderate positiveVisible upward trend, but substantial scatter
+0.1 to +0.4Weak positiveSlight trend, lots of scatter, easy to miss visually
0No linear correlationRandom scatter, no pattern
-0.1 to -0.4Weak negativeSlight downward trend
-0.4 to -0.7Moderate negativeVisible downward trend
-0.7 to -0.9Strong negativeClear downward trend
-1.0Perfect negativeEvery dot falls on a line going down-right

r² — The Number That Actually Tells You Something Useful


TERM DECONSTRUCTION: Coefficient of Determination (r²)

Word Surgery

  • Coefficient = "a number that works together with" (co + efficient; see regression article for full deconstruction)
  • of Determination = "that determines" (Latin determinare = "to set bounds, to fix limits")
  • = Pearson's r multiplied by itself
  • Literal meaning: "The number that determines how much of the variation is fixed/bounded by the relationship"

Why This Name?determines (sets the bounds of) how much of Y's variation is explained by X. It "fixes" the proportion: this much is explained, the rest is not. It was named "coefficient of determination" because it determines the explanatory power of the relationship.

The "Aha" Bridge So... r² tells you: what percentage of the variation in Y is accounted for by X? Everything else — the other (1-r²) — is noise, confounders, unmeasured variables, or biology being unpredictable.

This is the most important number in correlation analysis and the one most commonly ignored.


rInterpretation
0.90.8181% of Y's variation is explained by X
0.70.4949% of Y's variation is explained by X
0.50.2525% of Y's variation is explained by X
0.30.099% of Y's variation is explained by X

The key insight: A correlation of 0.3 sounds "moderate" to most clinicians. But r² = 0.09 means that X explains only 9% of the variation in Y. The other 91% comes from things you haven't measured. That "moderate correlation" is almost useless for prediction.

This is the most commonly missed interpretation in medical research. Authors report r = 0.45 as if it's a strong finding. r² = 0.20. Eighty percent of the variation is unexplained. The "correlated" variables share only 20% of their variation.

Think of it like a cricket match. The correlation between the opener's score and the team's total might be r = 0.5. That sounds decent. But r² = 0.25 means the opener's score explains only 25% of the team's total. The other 75% is everyone else batting, extras, and luck. Predicting the team's total from just the opener's score would be right only a quarter of the time.


The Five Traps — Where Correlation Misleads

Trap 1: Correlation is NOT Causation (The Classic)

You know this phrase. But do you know the THREE reasons why correlation fails to prove causation?

A. Confounding (Third Variable Problem)


TERM DECONSTRUCTION: Confounding

Word Surgery

  • Con- = "together" (Latin)
  • -found- = "to pour, to mix" (Latin fundere = "to pour")
  • Literal meaning: "Pouring together" — mixing up two effects so you can't tell them apart

Why This Name? A confounder "pours together" (con-founds) the effects of two variables. Ice cream sales and drowning deaths are both poured into the same pot by summer heat. You can't taste (distinguish) ice cream's effect from heat's effect unless you separate them.

The "Aha" Bridge So... confounding = mixing. The confounder is the mixer. It's related to both the exposure (ice cream) and the outcome (drowning) and creates a spurious connection between them. The correlation between ice cream and drowning is real in the data — but the CAUSE is heat, not ice cream.

Think of it like a wedding where two families sit together. If you photograph the crowd, you'll see "correlation" between the bride's family and groom's family being in the same place. But the wedding (confounder) brought them together. Without the wedding, they have no connection.


Ice cream sales and drowning deaths are positively correlated. Ice cream doesn't cause drowning. Both are caused by a third variable: hot weather.

In medicine: Statin use and lower mortality are correlated. But statin users are also more health-conscious, more likely to exercise, more likely to follow other medical advice (the "healthy user bias"). The correlation between statins and survival partially reflects confounding by healthy behaviour, not just the statin's pharmacological effect.

B. Reverse Causation

Depression and social isolation are correlated. Does depression cause isolation? Or does isolation cause depression? The correlation doesn't tell you the direction.

In medicine: Low vitamin D and multiple sclerosis are correlated. Does low vitamin D cause MS? Or does MS (which reduces outdoor activity and sun exposure) cause low vitamin D? The correlation supports both directions equally.

C. Coincidence (Spurious Correlation)


TERM DECONSTRUCTION: Spurious Correlation

Word Surgery

  • Spurious = from Latin spurius = "false, illegitimate, bastard" (originally: born outside marriage)
  • Correlation = relating together (as above)
  • Literal meaning: "A false/illegitimate relating together" — a correlation that looks real but is meaningless

Why This Name? The word "spurious" was borrowed from its original meaning of illegitimate birth. A spurious correlation is an "illegitimate" one — it exists in the data but has no real parent (no real cause connecting the variables). It's a statistical orphan pretending to have parents.

The "Aha" Bridge So... a spurious correlation is one where the numbers say "related" but reality says "nonsense." The number of films Nicolas Cage appeared in per year correlates with the number of people who drowned in swimming pools (r = 0.87). This is a real correlation in real data. It is also complete garbage.

Naming Family

  • Spurious (false, illegitimate)
  • Confounded (mixed up with a third variable — not necessarily false, but misleading)
  • Causal (a real, directional, mechanistic relationship)

With enough variables, some will be correlated by pure chance. The website "Spurious Correlations" (Tyler Vigen) catalogues hundreds of these: per capita cheese consumption correlates with deaths from bedsheet entanglement (r = 0.95). US spending on science correlates with suicides by hanging (r = 0.99).

These are REAL correlations in REAL data. The mathematics is correct. The interpretation is absurd. The number itself cannot tell you whether the relationship is meaningful.


Trap 2: r = 0 Does NOT Mean "No Relationship"

Pearson's r measures linear relationships only. If the relationship is curved, r can be zero even with a perfect mathematical relationship.

Example: Plot drug dose (X) against therapeutic effect (Y). At low doses, effect increases with dose (positive slope). At high doses, toxicity kicks in and effect decreases (negative slope). The relationship is an inverted U-shape — perfectly predictable, clinically critical.

Pearson's r for this data? Approximately zero. The positive slope at low doses and negative slope at high doses cancel out.

A researcher who reports "no significant correlation between dose and effect (r = 0.02)" has missed the most important pharmacological relationship in their data because they used the wrong tool.

Always plot your data first. If the relationship is curved, r is the wrong measure. Use Spearman's ρ (which captures monotonic but not necessarily linear relationships) or, better yet, use regression to model the curve.


Trap 3: Outliers Can Create or Destroy Correlations

One extreme data point can radically alter r.

Example: You're studying the relationship between BMI and fasting glucose in 30 patients. 29 patients show no relationship (r = 0.05). But one morbidly obese patient (BMI 55) with uncontrolled diabetes (glucose 350 mg/dL) is in the dataset. Include that one point and r jumps to 0.65.

Is the correlation "real"? Technically yes. Is it clinically meaningful? No — it's one patient driving the entire association. Remove that patient and the correlation vanishes.

This is why Spearman's ρ is more robust — ranks compress extreme values. A glucose of 350 becomes "rank 30" (highest), not "350." The rank is less influential than the raw value.


Trap 4: Correlation Between Means is Not Correlation Between Individuals


TERM DECONSTRUCTION: Ecological Fallacy

Word Surgery

  • Ecological = from Greek oikos = "house, habitat" + logos = "study." In statistics, "ecological" means "group-level" or "aggregate-level" (studying the habitat, not the individual organism)
  • Fallacy = from Latin fallacia = "deception, trick" (from fallere = "to deceive")
  • Literal meaning: "The deception of drawing individual conclusions from group-level data"

Why This Name? The term was coined by sociologist W.S. Robinson in 1950. He used "ecological" in the sociological sense: data about ecological units (groups, regions, populations) rather than individuals. The "fallacy" is assuming that what's true for the group is true for the individual.

The "Aha" Bridge So... the ecological fallacy = "the group told me so" error. Countries with higher chocolate consumption have more Nobel Prize winners (r = 0.79, NEJM, 2012). Does chocolate make you smart? No. Within each country, the chocolate-eaters are probably not the Nobel laureates. The correlation exists at the country level but may vanish (or reverse!) at the individual level.

Think of it like saying "this hospital has a high mortality rate, so any patient admitted here will die." The hospital-level statistic doesn't predict the individual patient's fate.

Naming Family

  • Ecological study (epidemiology: a study using group-level data, not individual data)
  • Atomistic fallacy (the reverse error: assuming individual-level findings apply to groups)
  • Simpson's paradox (trends that appear in groups reverse when groups are combined — related but distinct)

Countries with higher chocolate consumption have more Nobel Prize winners (r = 0.79, published in the New England Journal of Medicine, 2012). Does chocolate make you smart?

This is the ecological fallacy — correlating aggregate data (country-level means) and assuming it applies to individuals. Within each country, the people eating chocolate are probably not the Nobel laureates.

In medicine: Districts with higher average income have lower average infant mortality. This doesn't mean that if you give one family more money, their baby is less likely to die. The correlation operates at the ecological (group) level and may not exist at the individual level.

Robinson (1950) demonstrated this formally. Individual-level correlations and ecological (group-level) correlations can even have opposite signs. Rich districts may have low average mortality, but within those districts, the richest individuals might have higher mortality than the middle-class (due to stress, sedentary lifestyle, etc.).


Trap 5: Restricted Range Suppresses Correlation


TERM DECONSTRUCTION: Restriction of Range

Word Surgery

  • Restriction = from Latin restrictio = "a drawing back, a limiting" (re + stringere = "to draw tight, bind")
  • Range = the span from lowest to highest value
  • Literal meaning: "Tightening/limiting the span of values"

Why This Name? When your sample doesn't cover the full range of a variable (because of selection, filtering, or study design), the correlation is "restricted" — squeezed, compressed, weakened. The range of values has been drawn tight, and the correlation is a casualty.

The "Aha" Bridge So... restriction of range = looking at a narrow slice and missing the big picture. It's like judging the correlation between "hours studied" and "exam score" but only studying the top 5 students. Among the top 5, study hours vary little (they all study a lot), so the correlation looks weak. Include the whole class (from 0 hours to 20 hours) and the correlation is obvious.

The classic medical example: You measure the correlation between entrance exam scores and final exam performance in medical students. r = 0.15. Weak. You conclude: "Entrance exams don't predict medical school performance." But your "sample" already passed a highly competitive entrance exam. You've cut off the bottom 95%. In the full applicant pool, the correlation would be much higher.

Naming Family

  • Restriction of range (sampling only part of the spectrum)
  • Ceiling effect (all scores cluster near the maximum — a form of range restriction)
  • Floor effect (all scores cluster near the minimum)
  • Selection bias (the broader concept — your sample isn't representative)

You measure the correlation between entrance exam scores and final exam performance in medical students. r = 0.15. Weak. You conclude: "Entrance exams don't predict medical school performance."

But wait. Your "sample" of medical students already passed a highly competitive entrance exam. You've cut off the bottom 95% of the applicant pool. You're only looking at the top 5% — a narrow slice.

In the full applicant pool (including those who scored 20% on the entrance exam), the correlation with medical school performance would be much higher. But you can't see it because the low-scorers were never admitted.

This is "restriction of range." Correlation is suppressed when the sample doesn't span the full range of the variable. The correlation between height and basketball ability looks weak if you only study NBA players (all tall). Study the general population and it's obvious.


The Regulatory Dimension

FDA and Correlation — Where It Matters

1. Surrogate Endpoint Validation

This is the single most important regulatory application of correlation.


TERM DECONSTRUCTION: Surrogate Endpoint

Word Surgery

  • Surrogate = from Latin surrogatus = "substituted, put in place of another" (sub + rogare = "to ask/propose under")
  • Endpoint = the outcome being measured in a trial
  • Literal meaning: "A substitute outcome put in place of the real one"

Why This Name? A surrogate endpoint stands in for (substitutes for) the clinical endpoint you actually care about. Instead of waiting 5 years for overall survival data, you measure tumour shrinkage at 6 months. The tumour shrinkage is the "surrogate" — the substitute.

The word "surrogate" has the same root as "surrogate mother" — someone who stands in for another. A surrogate endpoint stands in for the real endpoint.

The "Aha" Bridge So... a surrogate is a body double. In movies, the stunt double takes the hits instead of the star. In clinical trials, the surrogate endpoint takes the measurement instead of the real outcome. But just as a stunt double doesn't always move exactly like the star, a surrogate doesn't always track the real outcome. The correlation between surrogate and real endpoint determines whether the substitution is valid.

Naming Family

  • Surrogate endpoint (the substitute measure)
  • Clinical endpoint (the real outcome: death, hospitalisation, quality of life)
  • Biomarker (a measurable biological indicator — surrogates are a special type of biomarker)
  • Validated surrogate (proven to correlate strongly enough with clinical outcomes across multiple trials)

A surrogate endpoint is a biomarker used in place of a clinical endpoint. Instead of waiting 5 years for overall survival data, you measure tumour shrinkage at 6 months (a surrogate).

But a surrogate is only valid if it correlates strongly with the clinical endpoint it replaces.

FDA's standard: "Reasonably likely to predict clinical benefit" (for accelerated approval) or "established surrogate" (for traditional approval based on surrogate).

How is this established? Correlation analysis across multiple trials:

  • HbA1c and diabetic complications: r approximately 0.7-0.8 across trials → HbA1c accepted as surrogate for diabetes trials
  • Blood pressure and cardiovascular events: well-established → BP accepted as surrogate for antihypertensive trials
  • Tumour response rate (ORR) and overall survival: r varies wildly by cancer type → ORR is sometimes accepted, sometimes not
  • Viral load (HIV) and AIDS progression: strong correlation → HIV RNA accepted as surrogate

When correlation between surrogate and clinical outcome is weak, drugs get approved that don't actually help patients. Bevacizumab in breast cancer improved PFS (the surrogate) but not OS (the clinical endpoint). The correlation between PFS and OS was weak in that cancer type. FDA granted accelerated approval, then revoked it.

The $2 billion question in drug development is often: "Is the correlation between our surrogate and the clinical endpoint strong enough for FDA to accept it?"

2. Biomarker Qualification

FDA's Biomarker Qualification Program requires evidence that a biomarker is correlated with clinical outcomes. The correlation evidence must come from multiple independent studies, not a single trial.

3. Assay Validation — Correlation Between Methods

When validating a new laboratory assay against a gold standard, you correlate the two measurements:

  • New troponin assay vs established assay: r = 0.98 → acceptable agreement
  • New POC glucose meter vs lab glucose: r = 0.95 → acceptable for screening

But correlation alone is insufficient for method comparison. This is a common error. r = 0.95 tells you the methods move together. It does NOT tell you they agree. One method could systematically read 20 mg/dL higher than the other and still have r = 0.99.


TERM DECONSTRUCTION: Bland-Altman

Word Surgery

  • Bland = J. Martin Bland, British statistician (University of York)
  • Altman = Douglas G. Altman (1948-2018), British statistician, founding director of the Centre for Statistics in Medicine at Oxford
  • Not a Latin deconstruction — this is an eponymous method (named after its creators)

Why This Name? Bland and Altman published their landmark paper in The Lancet in 1986, titled "Statistical methods for assessing agreement between two methods of clinical measurement." They showed that the entire medical literature was misusing correlation for method comparison. Their paper has been cited over 60,000 times.

What they proved: Correlation measures association (do they move together?). Agreement measures interchangeability (can you swap one for the other?). High correlation with systematic bias = associated but NOT agreeing. Two thermometers might correlate perfectly (r = 0.99) while one consistently reads 2 degrees higher. They're associated but you can't swap them.

The "Aha" Bridge So... the Bland-Altman plot shows: (1) the average difference between two methods (systematic bias), and (2) how much the differences scatter (limits of agreement). If the average difference is near zero AND the scatter is small, the methods agree. If the average difference is large (even with r = 0.99), there's systematic bias.

Think of it this way: Two clocks in a hospital. One runs 10 minutes fast. They correlate perfectly — when one says 3:00, the other always says 3:10. r = 1.0. But they don't AGREE. A nurse using clock A will give medications at different times than a nurse using clock B. Correlation missed this. Bland-Altman catches it.

Naming Family

  • Bland-Altman plot (the difference vs average plot)
  • Limits of agreement (mean difference +/- 1.96 SD — the range within which 95% of differences fall)
  • Method comparison study (the study design that requires Bland-Altman, not correlation)

FDA expects Bland-Altman analysis for method comparison studies, not just correlation.

4. Dose-Response Correlation

ICH E4 (Dose-Response Information) expects evidence of a dose-response relationship. Correlation between dose levels and response magnitudes is part of this evidence, though regression modelling (fitting dose-response curves) is the primary analytical tool.

5. Inter-Rater and Intra-Rater Reliability

For subjective endpoints (radiology reads, pathology scoring, clinical assessments), FDA expects evidence of reproducibility.


TERM DECONSTRUCTION: ICC (Intraclass Correlation Coefficient)

Word Surgery

  • Intra- = "within" (Latin intra-)
  • Class = "a group, category" (Latin classis)
  • Correlation = relating together (as deconstructed above)
  • Coefficient = a number that quantifies (as deconstructed above)
  • Literal meaning: "A number measuring correlation WITHIN a group/class"

Why This Name? Pearson's r measures correlation BETWEEN two different variables (X and Y). ICC measures correlation WITHIN a single class of measurements — for example, multiple raters all measuring the same thing. The "intra" (within) distinguishes it from "inter" (between) class correlation.

Why it exists: If three radiologists each score the same 50 CT scans for tumour grade, you don't have two variables (X and Y). You have three measurements of the SAME variable. Pearson's r can only compare two things. ICC can compare any number of raters measuring the same thing.

The "Aha" Bridge So... ICC = "how much do measurements within the same group agree?" If three raters give similar scores to the same patient, ICC is high. If they wildly disagree, ICC is low. Unlike Pearson's r, ICC captures both association AND agreement — it drops when one rater systematically scores higher than another.

Key distinction:

  • Pearson's r can be high even if raters systematically disagree (one always scores 10 points higher)
  • ICC drops in that situation because systematic disagreement = poor agreement within the class

FDA reviewers will flag a submission that reports inter-rater Pearson r = 0.90 and claims "excellent agreement." ICC is the appropriate measure.

Naming Family

  • ICC(1) — each subject rated by different raters (one-way random)
  • ICC(2,1) — each subject rated by the same raters, single measure (two-way random)
  • ICC(3,1) — each subject rated by the same raters, treating raters as fixed
  • Kappa (Cohen's κ) — agreement for categorical data (as ICC is for continuous data)

Branch-by-Branch — Where Correlation Bites You

General Medicine

The scenario: A paper reports "significant positive correlation between serum uric acid and cardiovascular mortality (r = 0.35, p < 0.001)."

Junior resident's conclusion: "High uric acid causes heart attacks. Should we treat asymptomatic hyperuricemia?"

What r = 0.35 actually tells you:

  • r² = 0.12. Uric acid explains 12% of the variation in cardiovascular mortality. 88% is explained by other things.
  • The correlation may be entirely confounded. Uric acid is elevated in metabolic syndrome, renal disease, and hypertension — all of which independently cause cardiovascular death. After adjusting for these confounders (which requires regression, not correlation), the association may vanish.
  • Multiple RCTs of urate-lowering therapy have failed to reduce cardiovascular events. If uric acid were causally related, lowering it should help. It doesn't. The correlation was confounded, not causal.

You can't make treatment decisions from correlation coefficients. You need regression (to adjust for confounders) and ideally RCTs (to prove causation).


Surgery

The scenario: "Surgeon experience (number of cases performed) correlates with lower complication rates (r = -0.62)."

This is probably real and probably partially causal (practice makes better). But:

  • More experienced surgeons also work at higher-volume centres (better resources, better teams)
  • More experienced surgeons get referred easier cases (selection bias)
  • r = -0.62 → r² = 0.38. Experience explains only 38% of complication rate variation. 62% is other factors.

The policy trap: If a hospital uses this correlation to set a minimum case volume threshold for credentialing, they're assuming the relationship is causal and linear. But the correlation doesn't prove either. A surgeon with 200 cases at a poorly equipped hospital may have worse outcomes than a surgeon with 50 cases at a state-of-the-art centre. The correlation between volume and outcomes confounds the surgeon with the system.


Paediatrics

The scenario: "Screen time is negatively correlated with developmental scores in children under 5 (r = -0.28)."

Media headline: "Screens Damage Your Child's Brain."

Reality check:

  • r² = 0.08. Screen time explains 8% of developmental variation. 92% is everything else.
  • Children with more screen time may have parents who work longer hours (SES confounder), less interactive play time, more processed food (dietary confounder), less outdoor time (activity confounder).
  • Reverse causation: children with developmental delays may be given more screen time because they're harder to engage in other activities.
  • The correlation might be entirely explained by parenting style, SES, or the child's baseline developmental trajectory.

A paediatrician who tells parents "screens are causing your child's delay" based on r = -0.28 from an observational study is making a causal claim from correlational data. The appropriate statement: "Screen time is associated with lower developmental scores, but we don't know if reducing screen time will improve scores because the relationship may be confounded."


Obstetrics

The scenario: "Maternal BMI is positively correlated with birth weight (r = 0.40)."

This correlation is partially causal (higher maternal glucose in obese mothers → fetal macrosomia) and partially confounded (higher BMI mothers may have gestational diabetes, which independently causes macrosomia; higher BMI mothers may have higher caloric intake, which independently affects fetal growth).

The clinical trap: Using this correlation to predict individual birth weights. r = 0.40 → r² = 0.16. Maternal BMI explains only 16% of birth weight variation. Predicting birth weight from BMI alone would be wrong for 84% of the variation. Yet residents sometimes mentally anchor: "obese mother = big baby." That's a cognitive shortcut built on a correlation that explains one-sixth of reality.


Psychiatry

The scenario: "Serum BDNF levels correlate with depression severity (HAM-D score), r = -0.55."

Pharma interpretation: "BDNF is a biomarker for depression. We can use it to monitor treatment response."

Problems:

  • r² = 0.30. BDNF explains 30% of depression severity variation. Not terrible, but insufficient for individual-level monitoring.
  • The correlation was measured cross-sectionally. Correlation at one timepoint doesn't mean that changes in BDNF predict changes in depression over time (that requires longitudinal correlation, which is usually weaker).
  • BDNF is affected by exercise, sleep, stress, medications, and time of blood draw. The "correlation with depression" may be partially confounded by these factors.
  • Individual BDNF values have enormous overlap between depressed and non-depressed groups. A correlation of -0.55 at the group level translates to almost complete overlap at the individual level.

This is the "screening test fallacy" of biomarkers. A statistically significant correlation does NOT mean the biomarker is useful for individual diagnosis or monitoring. You need sensitivity/specificity analysis, ROC curves, and predictive values — not just correlation.


Community Medicine / PSM

The scenario: "Per capita alcohol consumption correlates with liver disease mortality across Indian states (r = 0.73)."

The ecological fallacy in action. This correlation is between STATE-LEVEL averages. It does NOT mean that within each state, the people drinking are the ones dying of liver disease. States with high average consumption may also have worse hepatitis B/C prevalence, worse healthcare access, or different genetic susceptibility to alcohol.

The policy trap: A state government uses this ecological correlation to justify a blanket alcohol ban. But the correlation at the individual level might be lower (r = 0.30), and the state-level correlation might be driven entirely by 2-3 extreme states (restriction of range in reverse — a few extreme data points inflating the correlation).

Individual-level data → individual-level correlation → individual-level policy. Ecological data → ecological correlation → at best ecological-level hypothesis generation.


Orthopaedics

The scenario: "BMD (bone mineral density) measured by DEXA correlates with fracture risk (r = -0.45)."

This is the foundation of osteoporosis screening. But:

  • r² = 0.20. BMD explains only 20% of fracture risk variation. 80% is fall risk, bone architecture (not captured by DEXA), neuromuscular function, vitamin D status, medication use.
  • This means many patients with "normal" BMD will fracture (false reassurance), and many patients with "low" BMD will never fracture (unnecessary treatment).
  • FRAX was developed precisely because BMD alone (one correlation) was insufficient. FRAX combines BMD with multiple clinical risk factors using regression — because a single correlation was too weak.

The orthopaedic lesson: A single correlation between a risk factor and an outcome is almost never sufficient for clinical decision-making. You need multiple predictors combined in a regression model. Correlation identifies candidates for the model. Regression builds the model. Clinical trials validate the model.


Radiology / Pathology

The scenario: "CT volumetric measurements of tumour size correlate with pathological tumour size (r = 0.88)."

This seems excellent. But remember: correlation is NOT agreement.

r = 0.88 means CT and pathology MOVE TOGETHER. It does NOT mean CT and pathology AGREE.

CT could systematically overestimate tumour size by 15mm and still have r = 0.88 (strong correlation, systematic bias). For treatment planning that depends on exact tumour size (radiation field margins, surgical margins), this systematic bias matters — even though the correlation is "excellent."

The correct analysis: Bland-Altman plot showing mean difference (bias) and limits of agreement. If mean bias = 15mm with limits of agreement from 5mm to 25mm, CT consistently overestimates. A surgeon who trusts the CT size is cutting a margin 15mm wider than needed.

Bland & Altman (1986) wrote their famous paper precisely because the entire medical literature was using correlation for method comparison when it should have been using agreement analysis. Their paper has been cited over 60,000 times — making it one of the most cited statistics papers in history — because the error was so pervasive.


The 6 Ways Not Knowing Correlation Destroys You

1. You confuse correlation with causation and change practice

Every observational study that reports a correlation is describing an association, not a causal effect. Until an RCT confirms the direction of causation and excludes confounding, a correlation is a hypothesis, not an answer.

Changing clinical practice based on a correlation coefficient is like buying a house based on its photograph — you're acting on incomplete information and may be unpleasantly surprised.

2. You overestimate the strength of r and ignore r²

r = 0.5 sounds "moderate." r² = 0.25 means 75% of the variation is unexplained. If you don't square r, you systematically overestimate how informative the correlation is.

Rule of thumb: Always square r. If r² < 0.25, the correlation is too weak for individual-level prediction, regardless of how "significant" the p-value is.

3. You miss non-linear relationships because r is zero

A perfectly U-shaped or inverted-U relationship gives r approximately 0. If you only compute r and don't plot the data, you'll miss the relationship entirely. Always scatter plot first, then calculate.

4. You use Pearson's r when you should use Spearman's rho

If either variable is ordinal (pain scores, staging, grades), skewed, or has outliers — Pearson's r is wrong. Use Spearman's rho.

If your data has ONE extreme outlier that creates or destroys the correlation — Pearson's r is misleading. Spearman's rho (based on ranks) is robust.

The thesis examiner's question: "Why did you use Pearson's and not Spearman's?" If your answer is "because SPSS defaulted to it," you've just failed the methods question.

5. You use correlation for method comparison

Two lab assays "correlate at r = 0.96." Excellent? Not necessarily. They might correlate beautifully while one reads 30% higher than the other. Correlation measures co-movement, not agreement.

Use Bland-Altman analysis for method comparison. Use ICC for rater agreement. Use correlation only for association.

6. You can't evaluate surrogate endpoints

When FDA debates whether PFS is an acceptable surrogate for OS in a specific cancer type, the core evidence is the correlation between PFS and OS across trials. If you don't understand correlation — its strengths, its limits, the difference between r = 0.5 and r = 0.9 in terms of r² — you can't evaluate whether a surrogate is valid, and you're trusting regulatory decisions you can't critically appraise.


Correlation vs Regression — The Relationship Between the Relatives

Students confuse correlation and regression because they're closely related:

FeatureCorrelationRegression
What it measuresStrength and direction of associationThe equation connecting X to Y
OutputA single number: r (or ρ)An equation: Y = β₀ + β₁X
DirectionSymmetric: r(X,Y) = r(Y,X)Asymmetric: predicting Y from X is not the same as predicting X from Y
Purpose"Are these related?""Can I predict one from the other?"
Handles confounders?NoYes (multiple regression adjusts for covariates)
Handles multiple predictors?No (only pairwise)Yes (multiple regression)
Mathematical connectionr² = proportion of variance explained by the regressionThe regression line's slope = r x (SDy/SDx)

The key insight: Correlation is a screening tool — it tells you whether a relationship exists and how strong it is. Regression is the modelling tool — it builds the equation, adjusts for confounders, and makes predictions. Correlation is the first date. Regression is the relationship.

In a thesis: you might report correlations in a preliminary table (Table 2: "Correlations between variables"), then build a regression model with the important ones (Table 3: "Multiple regression analysis"). The correlation identifies candidates. The regression builds the model.


The One Thing to Remember

Correlation is a thermometer. It measures the temperature of the relationship between two variables. It tells you how hot or cold that relationship is.

But a thermometer can't tell you WHY the room is hot. Is it the heater? The sunlight? A fire? Body heat from 50 people?

Correlation measures the heat. Causation identifies the source. Confounding is the insulation that makes the wrong source look responsible. And r² tells you how much of the heat this particular source actually explains.

When you see a correlation coefficient in a paper, ask four questions:

  1. How strong is it really? (Square it. If r² < 0.25, it's weak regardless of p-value.)
  2. Is it linear? (Did they plot it? If not, the relationship could be curved and r = 0.)
  3. Is it confounded? (What third variables could explain the co-movement?)
  4. Is it the right measure? (Pearson vs Spearman? Correlation vs agreement?)

The resident who asks these four questions reads every correlation table in every paper with the right level of scepticism — not dismissing correlations, but not being seduced by them either. That's the difference between using evidence and being used by it.