Why Are Some Tests Named After What They Assume and Others After What They Don't?
The Problem First
You're a dermatology resident. Your thesis compares a new topical cream vs standard treatment for psoriasis severity. You measured PASI scores (0-72 scale) in 22 patients per group.
You open SPSS. You see a menu:
Compare Means → Independent Samples T-Test Nonparametric Tests → 2 Independent Samples → Mann-Whitney U
Two doors. Same question ("are the groups different?"). You pick one. You get a p-value. You write your thesis.
But WHICH door you pick determines WHETHER your p-value is correct. And the sign above the doors — "parametric" and "nonparametric" — tells you nothing useful in plain English. One sounds like it has parameters. The other sounds like it doesn't. What parameters? Why does having them or not change the maths? Who decided this naming system?
Your statistician friend says: "PASI scores are ordinal and skewed. Use Mann-Whitney."
Your supervisor says: "The t-test is more powerful. Use that."
They're both partially right and completely unhelpful, because neither told you WHY. The answer lies in a 200-year split in statistical philosophy that traces back to what you're willing to ASSUME about your data.
Word Surgery: "Parametric"
"Parametric"
Root: Greek para- (beside, alongside) + metron (measure) → parametron = "a measure alongside" / "a subsidiary measurement"
Wait — that doesn't obviously connect to what "parametric" means in statistics. Let's trace the detour:
In mathematics (19th century): A "parameter" came to mean a defining constant of a mathematical function. The bell curve (Gaussian distribution) has two parameters: μ (mean) and σ (standard deviation). If you know μ and σ, you know the ENTIRE curve — every percentile, every probability, everything. These two numbers fully DEFINE the shape.
In statistics: A "parametric" method is one that assumes the data comes from a distribution defined by parameters (usually the Gaussian, defined by μ and σ). The test's formulas were derived under this assumption. The p-value is correct ONLY IF this assumption holds.
→ So "parametric" literally = "based on parameters" = "based on the assumption that your data comes from a distribution with known, definable parameters (usually the bell curve)."
→ Aha: A parametric test says: "I believe your data was GENERATED by a bell curve. Give me your mean and SD and I'll tell you everything." If your data WASN'T generated by a bell curve, the test's beliefs are wrong and its answers are unreliable.
"Non-parametric"
Root: Latin non (not) + parametric → "not based on parameters" / "not assuming a specific distribution"
→ "Non-parametric" literally = "I make NO assumptions about which distribution generated your data."
→ Aha: A non-parametric test says: "I don't care what shape your data is. Bell, skewed, bimodal, flat — doesn't matter. I'll work with what you give me."
Why This Naming Is Confusing
Confusion 1: "Non-parametric" sounds like "has no parameters." But non-parametric tests DO produce test statistics, DO have null hypotheses, and DO have parameters they estimate (like median differences). The "non-parametric" part refers to the DISTRIBUTIONAL assumption, not the test itself. The test doesn't assume the DATA has specific distributional parameters. The test ITSELF still has parameters.
Confusion 2: "Parametric" sounds technical and superior. Students hear "parametric" and think "advanced, proper, preferred." They hear "non-parametric" and think "fallback, inferior, last resort." This is backwards for non-normal data — non-parametric tests are BETTER (more accurate, sometimes more powerful) when the parametric assumption is violated.
Confusion 3: The name describes what the test ASSUMES, not what it DOES. Both types of test do the same thing: compare groups. The name doesn't tell you the purpose — it tells you the hidden fine print. It's like naming cars "fuel-assuming" and "non-fuel-assuming" instead of "petrol" and "electric." The name describes the engine's requirements, not the vehicle's purpose.
Alternative Names That Would Have Been Clearer
| Actual Name | What It SHOULD Be Called | Why |
|---|---|---|
| Parametric test | Distribution-assuming test | Assumes a specific distribution |
| Non-parametric test | Distribution-free test | Doesn't assume any distribution |
"Distribution-free" IS an official alternative name for non-parametric tests. Wolfowitz (1942) coined it. It never caught on as widely as "non-parametric," which is unfortunate because "distribution-free" actually explains what the test does.
The History — Two Schools That Couldn't Agree
The Parametric Tradition: Gauss → Gosset → Fisher → Pearson
The parametric tradition began with the assumption that nature follows the bell curve.
Gauss (1809): Measurement errors follow a Gaussian distribution. The mean and variance (two parameters) describe everything.
Gosset / "Student" (1908): Derived the t-distribution for comparing means when the data is Gaussian and the sample is small. His derivation REQUIRES the normality assumption — the maths literally doesn't work otherwise.
Fisher (1920s): Built ANOVA (F-test) and regression on the Gaussian assumption. Every F-distribution table in every textbook assumes normally distributed residuals.
Karl Pearson (1900): Developed the chi-squared test for categorical data — which is parametric in the sense that it assumes a specific distribution (multinomial) but is often confused with non-parametric tests because it doesn't assume normality.
The parametric empire: t-test, paired t-test, ANOVA, repeated-measures ANOVA, Pearson correlation, linear regression, ANCOVA, MMRM. All assume Gaussian data (or Gaussian residuals). All produce wrong p-values when the assumption fails badly.
The Non-Parametric Revolution: Wolfowitz → Wilcoxon → Mann & Whitney → Kruskal & Wallis
The non-parametric school began with a simple observation: most real-world data doesn't follow the bell curve.
Jacob Wolfowitz (1942): Coined the term "non-parametric" in a paper titled "Additive Partition Functions and a Class of Statistical Hypotheses." He defined non-parametric methods as those that make "no assumption about the precise form of the sampled population."
Word Surgery: Why Wolfowitz Chose "Non-Parametric"
Wolfowitz was deliberately defining his methods by what they DON'T do — by their ABSENCE of assumptions. This negative framing was intentional:
The existing parametric methods were defined by their assumptions (Gaussian distribution with parameters μ and σ). Wolfowitz's methods had NO such assumptions. The most natural name was the NEGATION of the existing name: non-parametric.
It's a reactive name, not a constructive one. "Non-parametric" tells you what the method ISN'T, not what it IS. It's like calling vegetarian food "non-meat food." Accurate, but it defines the thing by what's absent.
Frank Wilcoxon (1945): A chemist at American Cyanamid Company, Wilcoxon published "Individual Comparisons by Ranking Methods" — a 3-page paper in Biometrics Bulletin that introduced rank-based tests. The idea was revolutionary in its simplicity:
Instead of using the actual values, convert them to RANKS (1st, 2nd, 3rd...) and do the analysis on the ranks.
Word Surgery: "Rank"
Root: Old French ranc/rang = "a row, a line" (originally from Germanic hring = ring, circle → a row of soldiers standing in a ring)
→ "Rank" = "position in a row when sorted from smallest to largest"
Why ranks fix the distribution problem:
Ranks are always uniformly distributed from 1 to n. Regardless of whether your original data is Gaussian, skewed, bimodal, or shaped like a camel — the ranks are always 1, 2, 3, ..., n. By converting to ranks, you DESTROY the distributional information (which you didn't want to assume anyway) and keep only the ORDERING information (which is truly in the data).
A patient with CRP = 340 mg/L and a patient with CRP = 35 mg/L might have ranks 30 and 15. The extreme value (340) becomes just "rank 30" — its extremity is compressed. This is why non-parametric tests are robust to outliers.
Henry Mann and Donald Whitney (1947): Independently developed and formalised the rank-sum test. The Mann-Whitney U test is mathematically equivalent to Wilcoxon's rank-sum test but with a different test statistic formula.
Word Surgery: "Mann-Whitney U"
"Mann-Whitney": Named after the two inventors. Simple eponymous naming.
"U": The letter U is the test statistic. Mann and Whitney defined U as the number of times an observation from one group PRECEDES an observation from the other group in the combined ranking. Why "U"? Most likely simply the next available letter in their notation — no deep meaning.
"Wilcoxon rank-sum" and "Mann-Whitney U" are the SAME TEST with different test statistics (W vs U) that are mathematically interchangeable. The dual naming causes enormous confusion. Students think they're different tests. They're not.
William Kruskal and W. Allen Wallis (1952): Extended the rank approach to k groups (more than two). Their test is the non-parametric equivalent of one-way ANOVA.
Word Surgery: "Kruskal-Wallis"
Eponymous. Named after both inventors. The test statistic is H (sometimes called the Kruskal-Wallis H statistic). Why "H"? Likely after "hypothesis" or simply the next available letter. No confirmed etymology.
The Complete Mapping — Which Test Replaces Which?
This is the table every medical resident needs:
| Research Question | Data Type | Parametric Test | Non-Parametric Alternative | When to Choose Non-Parametric |
|---|---|---|---|---|
| Compare 2 independent groups | Continuous | Independent t-test | Mann-Whitney U (= Wilcoxon rank-sum) | Skewed data, ordinal data, small n, outliers |
| Compare 2 paired/matched groups | Continuous | Paired t-test | Wilcoxon signed-rank | Skewed differences, ordinal data |
| Compare 3+ independent groups | Continuous | One-way ANOVA | Kruskal-Wallis H | Skewed data, unequal variances, ordinal |
| Compare 3+ paired groups | Continuous | Repeated-measures ANOVA | Friedman test | Skewed data, ordinal data |
| Correlation between 2 variables | Continuous | Pearson's r | Spearman's ρ (rho) | Non-linear monotonic, ordinal, outliers |
| Categorical data (2×2 or r×c) | Categorical | Chi-squared (χ²) | Fisher's exact | Expected cell counts < 5, small n |
Word Surgery: "Friedman Test"
Named after Milton Friedman — but NOT the famous economist. This Milton Friedman was a statistician (and yes, ALSO the famous economist — the same person). Milton Friedman developed the test in 1937 as part of his statistical work BEFORE he became known for economics. He later won the Nobel Prize in Economics (1976), but his statistical test lives on in medical research.
→ So every time you run a Friedman test, you're using a tool created by a future Nobel-winning economist who started as a statistician.
Word Surgery: "Fisher's Exact Test"
"Fisher's": Ronald Fisher. "Exact": Unlike chi-squared (which uses a large-sample approximation), Fisher's test calculates the EXACT probability under H0 by enumerating all possible table configurations.
Root of "exact": Latin exactus = "driven out, completed, precise" (past participle of exigere = to demand, to weigh precisely)
→ "Fisher's exact test" = "Fisher's precisely-calculated test" — no approximation, no assumption about sample size. The p-value is computed exactly from the hypergeometric distribution.
When to use it: When expected cell counts in a chi-squared test are < 5. The chi-squared test is a large-sample approximation. With small expected counts, the approximation breaks and the p-value is wrong. Fisher's exact test doesn't approximate — it enumerates.
The Core Mechanism — What Each Type Actually Does Differently
What a Parametric Test Does Inside
The t-test computes:
t = (Mean₁ - Mean₂) / SE_difference
Then looks up this t-value in the t-distribution (which was derived assuming Gaussian data) to get a p-value.
Every step assumes Gaussian:
- The MEAN is the optimal summary → only true for symmetric (Gaussian) data
- The VARIANCE in the SE formula is an efficient estimator → only optimal for Gaussian data
- The t-distribution is the correct reference → only true if the data is Gaussian (or n is large enough for CLT)
What a Non-Parametric Test Does Inside
The Mann-Whitney U computes:
- Combine both groups into one dataset
- Rank all values from 1 to N (lowest to highest)
- Sum the ranks in each group
- Compare the rank sums to what you'd expect if the groups were identical
No step assumes Gaussian:
- Ranking doesn't care about the shape — it only cares about ORDER
- The rank sum distribution under H0 is known EXACTLY for any sample size (calculated by enumeration or approximation)
- No variance formula is needed
- No reference distribution needs to be assumed for the data
The Tradeoff
| Feature | Parametric | Non-Parametric |
|---|---|---|
| Power when assumptions MET | Higher (uses more information from the data) | Slightly lower (~95% as efficient for Gaussian data) |
| Power when assumptions VIOLATED | Lower (wrong formula → lost power) | Higher (robust to violations) |
| Information used | Raw values (means, variances) | Ranks only (ordering) |
| Sensitivity to outliers | HIGH (outliers distort mean and variance) | LOW (outlier becomes highest rank, not an extreme value) |
| Applicable to ordinal data | NO (ordinal data has no meaningful mean) | YES (ordinal data can be ranked) |
| Assumptions | Normality, homoscedasticity, interval/ratio scale | Independence, ordinal or continuous scale |
The "95% Efficiency" Myth and Truth
Word Surgery: "Asymptotic Relative Efficiency" (ARE)
Root: Asymptotic = Greek a- (not) + syn (with) + ptotos (falling) → "not falling together" → approaching but never reaching a limit. In statistics: "as sample size approaches infinity."
ARE compares the power of two tests as n → ∞. The ARE of the Mann-Whitney U relative to the t-test for Gaussian data is 3/π ≈ 0.955.
This means: for perfectly Gaussian data with infinite sample size, the Mann-Whitney needs about 5% more subjects to achieve the same power as the t-test. The "cost" of not assuming normality is a 5% sample size increase.
But for non-normal data: The ARE of the Mann-Whitney can EXCEED 1.0 — meaning the non-parametric test is MORE powerful than the t-test. For heavily skewed or heavy-tailed distributions, the ARE can be 1.5 or higher. The t-test LOSES power because its formula is mismatched to the data.
→ The "parametric is more powerful" mantra is only true when the parametric assumptions are met. When they're violated — which is most of the time in clinical data — the non-parametric test often wins.
Why Is the Choice Confusing? — The Three Myths
Myth 1: "Parametric Tests Are Always More Powerful"
Truth: Only when assumptions are met. For skewed data with outliers, non-parametric tests are often more powerful because they aren't distorted by extreme values.
The damage: Residents choose the t-test "because it's more powerful" on skewed data with n=15, get a non-significant result, and conclude "no effect" — when the Mann-Whitney on the same data gives p=0.03. The "more powerful" test was actually LESS powerful for their specific data.
Myth 2: "Non-Parametric Tests Are Only for Small Samples"
Truth: Non-parametric tests work for any sample size. The misconception arose because:
- Small samples are where normality violations matter most (CLT doesn't protect you)
- Small samples are where normality is hardest to check (Shapiro-Wilk has low power)
- So textbooks say "use non-parametric when n is small and normality is doubtful"
Students interpreted this as "non-parametric = small sample method." It's not. It's a "no-distribution-assumption method" that works regardless of n.
Large sample with extreme skewness and outliers? Non-parametric tests are STILL appropriate. The CLT helps with the TEST STATISTIC's distribution but doesn't fix the fact that the MEAN is a poor summary of skewed data.
Myth 3: "You Must Formally Test for Normality Before Choosing"
Truth: Normality tests (Shapiro-Wilk) are helpful but not the sole criterion.
The paradox:
- Small samples (n < 20): Shapiro-Wilk has LOW POWER to detect non-normality. It will say "normal" even for obviously skewed data. You MOST need it here, and it LEAST works.
- Large samples (n > 500): Shapiro-Wilk has HIGH POWER and will reject normality for trivial, inconsequential deviations. It will say "non-normal" even when the data is Gaussian enough for parametric tests.
The practical approach:
- LOOK at the histogram and Q-Q plot — your eyes are better than Shapiro-Wilk for small samples
- Consider the data type — ordinal? bounded? counts? → non-parametric by default
- Consider the clinical context — are outliers expected? is skewness typical for this variable?
- Use Shapiro-Wilk as supplementary evidence, not as a binary decision rule
The Regulatory Dimension
FDA and the Parametric/Non-Parametric Choice
1. ICH E9: Pre-Specification Required
ICH E9 Section 5.2.2: The statistical analysis plan must specify the primary analysis method AND sensitivity analyses. If the primary is parametric (ANCOVA), a non-parametric sensitivity analysis must be pre-specified for robustness.
The SAP framework:
Primary analysis: ANCOVA (parametric) adjusting for baseline and stratification factors Sensitivity analysis 1: Rank-based ANCOVA (non-parametric) Sensitivity analysis 2: MMRM with alternative covariance structure Decision rule: If primary and sensitivity analyses are concordant → report primary. If discordant → present both and discuss.
FDA reviewers ALWAYS check concordance. If the parametric primary gives p=0.04 but the non-parametric sensitivity gives p=0.09, the reviewer will question the robustness of the finding. The parametric result may have been driven by outliers or distributional violations.
2. Specific Endpoints That Mandate Non-Parametric
| Endpoint Type | Why Non-Parametric | FDA-Expected Method |
|---|---|---|
| Ordinal scales (mRS 0-6, NYHA class I-IV, pain NRS 0-10) | Not continuous, no meaningful mean, ceiling/floor effects | Wilcoxon rank-sum, proportional odds model, or shift analysis |
| Bounded scales with floor/ceiling effects (VAS 0-100, Barthel Index 0-100) | Clustering at bounds creates severe non-normality | Rank-based methods or responder analysis |
| Time-to-event with non-proportional hazards | Cox model assumes proportional hazards; when violated, HR is misleading | Restricted mean survival time (RMST), piecewise models |
| Count data (seizure counts, exacerbation counts) | Poisson or negative binomial, never Gaussian | Negative binomial regression (parametric but non-Gaussian) |
| Quality of life composites (EQ-5D, SF-36) | Multi-dimensional, often skewed, bounded | Rank-based or responder analysis |
3. The Stroke Trial Paradigm — Where Non-Parametric Won
In acute stroke trials, the primary endpoint is often the modified Rankin Scale (mRS) — a 7-point ordinal scale (0 = no symptoms, 6 = dead).
Early stroke trials used dichotomisation: mRS 0-1 vs 2-6 ("good outcome" vs "bad outcome"). This is a parametric-ish approach (logistic regression on binary outcome). But it THROWS AWAY information — a patient improving from mRS 5 to mRS 3 counts the same as a patient staying at mRS 3.
Modern stroke trials use shift analysis: The Wilcoxon rank-sum test (or proportional odds model) across the ENTIRE ordinal scale. This non-parametric approach uses ALL the information in the scale.
The ECASS-III trial (alteplase for stroke): Used ordinal shift analysis as the primary analysis. FDA approved alteplase partly based on this non-parametric approach.
The shift to non-parametric methods in stroke trials was a regulatory watershed. FDA and EMA now PREFER ordinal analysis over dichotomisation for mRS. A non-parametric method became the regulatory standard because it was scientifically superior.
4. Bioequivalence — A Parametric Domain
Bioequivalence studies are one area where parametric methods are firmly mandated. AUC and Cmax are log-transformed (making them approximately Gaussian) and analysed by ANOVA.
FDA BE Guidance: "The statistical method for analyzing BE data is based on the two one-sided tests procedure using the log-transformed AUC and Cmax data."
Why parametric works here: The log-transformation converts right-skewed PK data into approximately Gaussian data. The parametric assumption is ENFORCED by transformation. The confidence interval interpretation requires parametric methods.
This is the correct use of parametric methods: when you can TRANSFORM the data to meet the assumption, or when the assumption is reasonably justified.
Branch-by-Branch — Where the Choice Bites You
General Medicine
The sin: Comparing length of stay (LOS) between two treatment arms using a t-test.
LOS is ALWAYS right-skewed. Most patients: 3-5 days. A few: 20-40 days (ICU transfers, complications). The mean is dragged right by the long-stayers. The SD is inflated. The t-test's SE is wrong. The p-value is unreliable.
The fix: Mann-Whitney U test. Report median (IQR), not mean ± SD.
Real-world consequences: A hospital quality committee uses mean LOS from a t-test analysis to evaluate two care pathways. Pathway A: mean LOS 6.2 days (dragged up by 3 patients who stayed 30+ days). Pathway B: mean LOS 5.1 days (no extreme stays). The t-test says p=0.08, "no significant difference."
Mann-Whitney on the same data: p=0.02. Pathway B is significantly better for the TYPICAL patient. The parametric test was fooled by 3 outliers. The non-parametric test was not. Policy decision reversed.
Surgery
The sin: Using paired t-test to compare pre- and post-operative functional scores.
Pre-operative scores: roughly Gaussian (wide range of disability). Post-operative scores: heavily left-skewed (most patients improve to near-normal, ceiling effect at max score).
The paired t-test analyses the DIFFERENCES. If differences are Gaussian → paired t-test is valid. But post-op ceiling effects create non-Gaussian differences (many patients with near-maximum improvement, a few with minimal improvement).
The fix: Wilcoxon signed-rank test on the paired differences. Or check the distribution of differences first — if they're Gaussian, the paired t-test is fine.
The trap: A surgeon publishes "significant improvement in WOMAC scores (paired t-test, p=0.001)." Reviewer asks: "Was the distribution of score changes normally distributed? Given the ceiling effects in WOMAC, a Wilcoxon signed-rank test should be reported at minimum as a sensitivity analysis." Paper sent back for revision.
Paediatrics
The sin: Using ANOVA to compare APGAR scores across three delivery methods.
APGAR is a 0-10 ordinal scale. It is NOT continuous. The "distance" between APGAR 2 and 3 is not the same as between 7 and 8 (clinically, the difference between 2 and 3 is far more consequential). It has severe ceiling effects in term deliveries (most babies score 8-10).
ANOVA treats APGAR as continuous and Gaussian. It isn't.
The fix: Kruskal-Wallis H test. Or even better, ordinal logistic regression if you want to adjust for confounders.
Obstetrics
The sin: Pearson correlation between maternal weight gain and neonatal birth weight.
Maternal weight gain has a long right tail (some women gain 30+ kg with GDM). Neonatal birth weight in a mixed population (preterm + term) is bimodal.
Pearson's r assumes both variables are continuous, roughly Gaussian, and linearly related. None of these may hold.
The fix: Spearman's ρ (rho). It captures monotonic relationships (as weight gain goes up, birth weight tends to go up) without assuming linearity or normality.
The difference matters: If the relationship is monotonic but curved (accelerating weight gain → accelerating birth weight at higher ranges), Pearson's r underestimates the association. Spearman's ρ captures it fully because it works on ranks, which linearise any monotonic curve.
Psychiatry
The sin: Using repeated-measures ANOVA on HAM-D scores measured at weeks 0, 2, 4, 8, 12.
HAM-D is bounded (0-52), ordinal, and the distribution changes shape over time. At baseline: roughly symmetric around 22-28. At endpoint in responders: right-skewed as scores approach 0 (floor effect).
Repeated-measures ANOVA assumes normality at every timepoint AND sphericity (equal variances of differences between all timepoint pairs). Both assumptions fail.
The fix: MMRM (still parametric but more robust to non-normality) with sensitivity analysis using non-parametric methods (Wilcoxon at each timepoint with multiplicity adjustment, or rank-based ANCOVA).
What FDA actually wants: MMRM as primary (ICH E9 compliance), with non-parametric sensitivity. If the HAM-D data is severely non-normal, FDA reviewers may give MORE weight to the non-parametric sensitivity than the parametric primary.
Community Medicine / PSM
The sin: Using t-tests to compare per-capita healthcare expenditure between intervention and control districts.
Healthcare expenditure is the TEXTBOOK example of extreme right-skew. Most families: ₹0-5,000/year. A few catastrophic cases: ₹2-5 lakh (cancer, cardiac surgery, NICU). The distribution has a massive right tail.
Mean expenditure is meaningless. Mean ± SD is absurd (SD > mean, implying negative expenditure). The t-test is comparing numbers that describe nobody.
The fix: Mann-Whitney U for comparing medians. Or two-part models: logistic regression for any expenditure (yes/no) + gamma regression for the amount (conditional on having expenditure).
Policy impact: A government evaluates a health insurance scheme. Mean expenditure analysis (t-test): "No significant difference in spending." Median analysis (Mann-Whitney): "Significant reduction in out-of-pocket spending for the bottom 80%." The mean was dragged by a few catastrophic cases unaffected by the scheme. The median showed the scheme worked for most people. The wrong test → the wrong policy → millions of beneficiaries affected.
Orthopaedics
The sin: Using Pearson correlation to assess inter-rater agreement of Cobb angle measurements.
Two problems:
- Pearson's r measures ASSOCIATION, not AGREEMENT (a systematic bias of +10° still gives high r)
- Cobb angle measurements in scoliosis populations are often skewed (most mild cases, few severe)
The fix: Spearman's ρ for correlation (robust to outliers). ICC (Intraclass Correlation Coefficient) for agreement. Bland-Altman for visual assessment of bias and limits of agreement.
Radiology / Pathology
The sin: Comparing tumour volumes between responders and non-responders using a t-test.
Tumour volumes are log-normally distributed — ALWAYS. A few massive tumours dominate the mean and variance. The t-test on raw volumes is meaningless.
The fix:
- Option A: Log-transform volumes, check normality, use t-test on log-transformed data
- Option B: Mann-Whitney U on raw volumes
- Option C: Compare geometric means (which is the t-test on log-transformed data, reported back-transformed)
FDA oncology reviewers expect log-transformation or non-parametric methods for tumour size endpoints. Raw-scale parametric analysis of tumour volumes would be flagged in regulatory review.
The Decision Flowchart — How to Choose
Step 1: What TYPE is your data? │ ├── CATEGORICAL (yes/no, disease/no disease, mild/moderate/severe) │ └── Chi-squared or Fisher's exact → NOT parametric vs non-parametric issue │ ├── ORDINAL (pain score 0-10, mRS 0-6, NYHA I-IV, Likert scales) │ └── ALWAYS non-parametric (Mann-Whitney, Kruskal-Wallis, Spearman's ρ) │ └── Ordinal data has no meaningful mean → parametric tests are INVALID │ └── CONTINUOUS (BP, HbA1c, weight, serum creatinine) │ Step 2: Check distribution (histogram + Q-Q plot + Shapiro-Wilk) │ ├── NORMAL (symmetric bell, Shapiro-Wilk p > 0.05, Q-Q points on line) │ └── Parametric tests (t-test, ANOVA, Pearson's r) │ └── Report: mean ± SD │ ├── NON-NORMAL but TRANSFORMABLE (right-skewed → log-transform fixes it) │ └── Transform → check normality → if normal: parametric on transformed data │ └── Report: geometric mean (95% CI) or back-transformed values │ └── NON-NORMAL and NOT TRANSFORMABLE (bimodal, ceiling/floor, extreme skew) └── Non-parametric tests (Mann-Whitney, Kruskal-Wallis, Spearman's ρ) └── Report: median (IQR)
The examiner's shortcut question: "Is your data continuous AND normally distributed?" Yes → parametric. Everything else → non-parametric (or transform-then-parametric).
The 7 Ways Not Knowing This Destroys You
1. You use a t-test on ordinal data
Pain scores (0-10), satisfaction scales (1-5), mRS (0-6) — these are ORDINAL. The intervals between numbers are NOT equal. The "distance" between pain 2 and pain 3 is not the same as between pain 7 and pain 8. Computing a mean of ordinal data is mathematically meaningless. Running a t-test on it is comparing meaningless numbers with a formula that assumes they're meaningful.
2. You lose power by using the "more powerful" test on the wrong data
Paradoxically, choosing the t-test "because it's more powerful" on skewed data with outliers gives you LESS power. The outliers inflate the variance, inflate the SE, shrink the t-statistic, and inflate the p-value. The Mann-Whitney, immune to outliers through ranking, often has HIGHER power in exactly these situations.
3. Your examiner asks THE question and you can't answer it
"Why did you use this test? Did you check the assumptions?"
This is the single most common thesis viva question in biostatistics. If your answer is "because my statistician/supervisor told me to," you fail the methods question. The correct answer demonstrates that you checked normality, considered the data type, and chose the test accordingly.
4. Your paper gets sent back in peer review
Reviewers in good journals check the match between data type and statistical test. "The authors used a t-test on pain scores (ordinal data). Please revise using an appropriate non-parametric test." This delays publication by months and signals methodological weakness to the editor.
5. You misinterpret non-significant results from mismatched tests
A t-test on skewed data gives p=0.08 → "not significant" → "no difference." Mann-Whitney on the same data gives p=0.02 → "significant" → "there IS a difference." The conclusion flips based on test choice. If you don't understand why, you'll either miss a real finding or report a false one.
6. You can't read modern stroke, oncology, or PRO trial reports
Shift analysis (ordinal), rank-based ANCOVA, proportional odds models, responder analysis, restricted mean survival time — the cutting edge of clinical trial methodology is moving AWAY from simple parametric tests and TOWARD methods that respect the data's true nature. A resident who only knows "t-test and ANOVA" cannot read a modern trial report.
7. You can't evaluate whether a published study's conclusions are trustworthy
When a paper reports results, the validity of the p-value depends entirely on whether the chosen test matched the data's distribution and type. If you can't evaluate this match, you can't evaluate the paper. You're trusting conclusions that may be built on the wrong statistical foundation.
The One Thing to Remember
Parametric tests assume your data was born from a bell curve. Non-parametric tests don't care where your data was born.
The choice is not "advanced vs basic" or "powerful vs weak." It's "does your data meet the assumptions or doesn't it?" Use the test that matches your data's REALITY, not the test that matches your WISH for what the data should look like.
"Parametric" means "based on parameters of an assumed distribution." "Non-parametric" means "distribution-free." Wolfowitz named them in 1942. Wilcoxon gave us the rank-based tools in 1945. Mann, Whitney, Kruskal, Wallis, and Friedman completed the toolkit by 1952.
These tools exist because Galton's Victorian dream of universal normality was wrong. Most clinical data is skewed, bounded, ordinal, or contaminated by outliers. The non-parametric revolution gave us tests that work on data as it IS, not as we WISH it were.
The resident who matches the test to the data produces reliable results. The resident who defaults to the t-test on everything produces results that are sometimes right, sometimes wrong, and never verifiably either — which is the worst possible outcome in science.
Check the shape. Check the type. Choose the test. In that order. Always.