P-values quantify the strength of evidence against a null hypothesis — specifically, the probability of observing a test statistic as extreme as yours (or more) if the null hypothesis were true. Despite being one of the most-used statistical concepts in scientific research, p-values are also one of the most misunderstood and misused numbers in all of statistics, contributing to the "replication crisis" in psychology, medicine, and social sciences. The sections below cover what a p-value actually measures (and what it does not), how over-reliance on p-value thresholds has driven p-hacking and questionable research practices, and why reporting effect size alongside p-values is essential for honest interpretation of statistical results.
What a P-Value Is (and Is Not)
A p-value is the probability of seeing data as extreme as yours (or more extreme) if the null hypothesis were true. It is emphatically NOT the probability that the null hypothesis is true, and this distinction is one of the most important conceptual points in applied statistics. A p-value of 0.03 means: if H₀ were true, there is only a 3% chance of seeing a result this extreme or more extreme. It says nothing directly about whether H₀ is actually correct or false — that would require Bayesian analysis incorporating prior probabilities.
Common misinterpretations to avoid: "p = 0.03 means there's a 3% chance H₀ is true" (wrong, this is the inverse probability P(H₀|data) which requires Bayes' theorem). "p = 0.03 means there's a 97% chance H₁ is true" (also wrong). "p > 0.05 means H₀ is probably true" (wrong — absence of evidence is not evidence of absence, especially with small samples). The correct interpretation is narrow: p-value is a statement about how unusual your data would be if H₀ were true. Use p-values to quantify surprise under a null model, not to make probability statements about the hypotheses themselves.
The Replication Crisis & P-Hacking
Over-reliance on the p < 0.05 threshold has contributed significantly to irreproducible research in psychology, medicine, and social sciences — a problem widely documented since the 2010s replication crisis. "P-hacking" is the practice of running multiple analyses, testing various subgroups, trying different outcome measures, or excluding certain data points until something crosses the 0.05 threshold. This inflates false-positive rates well above the nominal 5%, meaning many published "significant" findings are statistical noise that won't replicate in follow-up studies.
Modern best practices include pre-registering hypotheses and analysis plans before collecting data (forcing researchers to commit to specific tests rather than fishing afterward), reporting effect sizes alongside p-values so readers can judge practical significance not just statistical significance, using confidence intervals to communicate uncertainty magnitude, adopting Bayesian methods where appropriate for explicit prior-incorporation, and treating p = 0.05 as a rough guideline rather than a hard binary threshold. Some journals and fields have moved to stricter thresholds (p < 0.005) or require pre-registration for publication. The Open Science Framework tools support pre-registration and registered reports that have substantially improved reproducibility rates in participating fields.
Effect Size Is As Important As P-Value
With large enough samples, nearly any difference from zero becomes statistically significant because p-values shrink as sample size grows, even when the underlying effect is tiny. A p-value of 0.0001 for a Cohen's d of 0.05 means you have a real but trivially small effect — not worth acting on despite being "highly significant." Big-data studies in the millions of observations routinely produce p < 0.001 for effect sizes that are real but clinically or practically meaningless. Statistical significance is not the same as practical significance.
Always pair p-values with effect size estimates (Cohen's d for mean differences, r² for correlation strength, η² or partial η² for ANOVA effect sizes) and confidence intervals for a complete statistical picture. Cohen's d interpretation: 0.2 = small effect, 0.5 = medium, 0.8 = large. A reasonable rule of thumb: if Cohen's d < 0.2, the effect is too small to matter in most real-world decisions regardless of p-value. This "is it practically meaningful?" filter applied alongside "is it statistically significant?" produces much better scientific and business decisions than p-value thresholds alone. Modern guidelines from APA, AMA, and other professional bodies explicitly require effect size reporting in addition to p-values in published research.