A p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one you calculated, assuming the null hypothesis is true. It does NOT give the probability that the null hypothesis is true.

What is the difference between one-tailed and two-tailed tests?

A two-tailed test checks for a difference in either direction (H₁: μ ≠ μ₀). A one-tailed test checks in a specific direction — right-tailed (H₁: μ > μ₀) or left-tailed (H₁: μ < μ₀). Two-tailed is more conservative and most common in practice.

What does statistical power mean?

Statistical power (1−β) is the probability of correctly detecting a real effect when one exists. Power of 0.80 means an 80% chance of rejecting a false null hypothesis. Underpowered studies miss real effects. Most journals require 80% power minimum.

P-Value Calculator — Calculover

P-Value Range	Evidence Against H₀	Action
p > 0.10	Little or none	Fail to reject H₀
0.05 < p ≤ 0.10	Weak / marginal	Inconclusive — gather more data
0.01 < p ≤ 0.05	Moderate	Reject H₀ at 5% level
0.001 < p ≤ 0.01	Strong	Reject H₀ at 1% level
p ≤ 0.001	Very strong	Reject H₀ at 0.1% level

1

Choose Your Test Type

Select one-sample Z, two-sample Z, or T-test and enter the test statistic (plus degrees of freedom for T-tests).

2

Set Tail Direction & α

Choose left-tailed, right-tailed, or two-tailed matching your alternative hypothesis and pick a significance level (0.05 is standard).

3

Read the Results

The calculator shows the p-value, decision (reject/fail to reject H₀), effect size, and a live bell curve with the shaded p-value region.

Right-Tailed P-Value

p = P(Z ≥ z) = 1 − Φ(z)

The area in the right tail beyond the test statistic. Use when H₁: μ > μ₀.

Left-Tailed P-Value

p = P(Z ≤ z) = Φ(z)

The area in the left tail up to the test statistic. Use when H₁: μ < μ₀.

Two-Tailed P-Value

p = 2 × P(Z ≥ |z|) = 2 × (1 − Φ(|z|))

Twice the one-tailed probability; tests for any difference from H₀ regardless of direction. Most common in practice.

P-Value The probability of obtaining a test statistic as extreme as the one observed, given that H₀ is true.

Null Hypothesis (H₀) The default assumption of no effect or no difference in the population.

Alternative Hypothesis (H₁) The claim being tested; asserts that an effect or difference exists.

Significance Level (α) The threshold below which p leads to rejection of H₀; commonly 0.05.

Type I Error Rejecting a true null hypothesis (false positive). Probability = α.

Type II Error (β) Failing to reject a false null hypothesis (false negative). Probability = 1 − power.

Cohen's d Standardised effect size: |mean difference| / pooled SD. Small≈0.2, medium≈0.5, large≈0.8.

Statistical Power Probability of correctly rejecting a false H₀ = 1 − β. Target ≥ 0.80.

📊

Does a new drug lower blood pressure?

Result p = 0.0316 → Significant. Reject H₀. d ≈ 0.30 (small-medium effect).

p = 0.0316 → Significant. Reject H₀. d ≈ 0.30 (small-medium effect).

💰

Does version B of a landing page convert better?

Result p = 0.0668 → Not significant. Fail to reject H₀. Need more data.

p = 0.0668 → Not significant. Fail to reject H₀. Need more data.

💼

Does a mindfulness intervention reduce anxiety? (T-test)

Result p ≈ 0.017 → Significant. Reject H₀. Exact t-distribution used.

p ≈ 0.017 → Significant. Reject H₀. Exact t-distribution used.

P-values quantify the strength of evidence against a null hypothesis — specifically, the probability of observing a test statistic as extreme as yours (or more) if the null hypothesis were true. Despite being one of the most-used statistical concepts in scientific research, p-values are also one of the most misunderstood and misused numbers in all of statistics, contributing to the "replication crisis" in psychology, medicine, and social sciences. The sections below cover what a p-value actually measures (and what it does not), how over-reliance on p-value thresholds has driven p-hacking and questionable research practices, and why reporting effect size alongside p-values is essential for honest interpretation of statistical results.

What a P-Value Is (and Is Not)

A p-value is the probability of seeing data as extreme as yours (or more extreme) if the null hypothesis were true. It is emphatically NOT the probability that the null hypothesis is true, and this distinction is one of the most important conceptual points in applied statistics. A p-value of 0.03 means: if H₀ were true, there is only a 3% chance of seeing a result this extreme or more extreme. It says nothing directly about whether H₀ is actually correct or false — that would require Bayesian analysis incorporating prior probabilities.

Common misinterpretations to avoid: "p = 0.03 means there's a 3% chance H₀ is true" (wrong, this is the inverse probability P(H₀|data) which requires Bayes' theorem). "p = 0.03 means there's a 97% chance H₁ is true" (also wrong). "p > 0.05 means H₀ is probably true" (wrong — absence of evidence is not evidence of absence, especially with small samples). The correct interpretation is narrow: p-value is a statement about how unusual your data would be if H₀ were true. Use p-values to quantify surprise under a null model, not to make probability statements about the hypotheses themselves.

The Replication Crisis & P-Hacking

Over-reliance on the p < 0.05 threshold has contributed significantly to irreproducible research in psychology, medicine, and social sciences — a problem widely documented since the 2010s replication crisis. "P-hacking" is the practice of running multiple analyses, testing various subgroups, trying different outcome measures, or excluding certain data points until something crosses the 0.05 threshold. This inflates false-positive rates well above the nominal 5%, meaning many published "significant" findings are statistical noise that won't replicate in follow-up studies.

Modern best practices include pre-registering hypotheses and analysis plans before collecting data (forcing researchers to commit to specific tests rather than fishing afterward), reporting effect sizes alongside p-values so readers can judge practical significance not just statistical significance, using confidence intervals to communicate uncertainty magnitude, adopting Bayesian methods where appropriate for explicit prior-incorporation, and treating p = 0.05 as a rough guideline rather than a hard binary threshold. Some journals and fields have moved to stricter thresholds (p < 0.005) or require pre-registration for publication. The Open Science Framework tools support pre-registration and registered reports that have substantially improved reproducibility rates in participating fields.

Effect Size Is As Important As P-Value

With large enough samples, nearly any difference from zero becomes statistically significant because p-values shrink as sample size grows, even when the underlying effect is tiny. A p-value of 0.0001 for a Cohen's d of 0.05 means you have a real but trivially small effect — not worth acting on despite being "highly significant." Big-data studies in the millions of observations routinely produce p < 0.001 for effect sizes that are real but clinically or practically meaningless. Statistical significance is not the same as practical significance.

Always pair p-values with effect size estimates (Cohen's d for mean differences, r² for correlation strength, η² or partial η² for ANOVA effect sizes) and confidence intervals for a complete statistical picture. Cohen's d interpretation: 0.2 = small effect, 0.5 = medium, 0.8 = large. A reasonable rule of thumb: if Cohen's d < 0.2, the effect is too small to matter in most real-world decisions regardless of p-value. This "is it practically meaningful?" filter applied alongside "is it statistically significant?" produces much better scientific and business decisions than p-value thresholds alone. Modern guidelines from APA, AMA, and other professional bodies explicitly require effect size reporting in addition to p-values in published research.

What is a p-value?+

A p-value is the probability of observing a test statistic as extreme as yours (or more extreme), given that the null hypothesis is true. It does NOT tell you the probability that the null hypothesis is correct — that's a common and important misconception.

When should I use a one-tailed vs two-tailed test?+

Use a two-tailed test when you're testing for any difference regardless of direction (H₁: μ ≠ μ₀) — this is the most common and conservative choice. Use a one-tailed test only when you have a strong pre-specified directional hypothesis (e.g. H₁: μ > μ₀) before seeing the data.

What is the difference between a Z-test and a T-test?+

A Z-test is used when the population standard deviation is known or when sample sizes are large (n > 30). A T-test is used when the population SD is unknown and must be estimated from the data — which is nearly always the case in practice.

Can p > 0.05 prove the null hypothesis?+

No. A large p-value means you lack sufficient evidence to reject H₀ — not that H₀ is proven true. The study may have been underpowered (too small n), or the effect may be real but smaller than detectable. Equivalence tests (TOST) or Bayesian methods are needed to formally support the null.

What is the Bonferroni correction and when should I use it?+

When you run multiple hypothesis tests at α=0.05, the probability of at least one false positive grows quickly — with 20 independent tests, you'd expect one false positive by pure chance. The Bonferroni correction sets the threshold to α/k (e.g., 0.05/5 = 0.01 for 5 tests) to control the family-wise error rate.

How do I interpret statistical power?+

Power (1−β) is the probability of correctly detecting a real effect. Power of 0.80 means an 80% chance of a significant result if the effect is real. Underpowered studies often fail to replicate. Use the Power Analysis tab to determine the required sample size for your expected effect size before collecting data.

What is Cohen's d and how do I interpret it?+

Cohen's d is a standardized effect size calculated as d = |μ₁ − μ₂| / σ_pooled, producing a unitless number that's comparable across studies. Rules of thumb: d ≈ 0.2 is small, d ≈ 0.5 is medium, d ≈ 0.8 is large.

What is a critical value?+

The critical value is the threshold your test statistic must exceed to reject H₀ at a given α. For a two-tailed z-test at α=0.05, the critical values are ±1.96. If |z| > 1.96, you reject H₀. Equivalently, p < α ⟺ |z| > z_critical.

P-Value Calculator

Test Setup

Z-Score to P-Value Reference Table

T-Distribution Critical Values

P-Value Decision Guide

Multiple Comparisons — Bonferroni Correction

Power Calculator

Power vs Sample Size

How to Use This Calculator

Choose Your Test Type

Set Tail Direction & α

Read the Results

Formula & Methodology

Key Terms Explained

Real-World Examples

Does a new drug lower blood pressure?

Does version B of a landing page convert better?

Does a mindfulness intervention reduce anxiety? (T-test)

P-Values: Interpreting Statistical Significance

What a P-Value Is (and Is Not)

The Replication Crisis & P-Hacking

Effect Size Is As Important As P-Value

Frequently Asked Questions

P-Value Calculator

Test Setup

Z-Score to P-Value Reference Table

T-Distribution Critical Values

P-Value Decision Guide

Multiple Comparisons — Bonferroni Correction

Power Calculator

Power vs Sample Size

How to Use This Calculator

Choose Your Test Type

Set Tail Direction & α

Read the Results

Formula & Methodology

Key Terms Explained

Real-World Examples

P-Values: Interpreting Statistical Significance

What a P-Value Is (and Is Not)

The Replication Crisis & P-Hacking

Effect Size Is As Important As P-Value

Frequently Asked Questions

Keep Exploring

Related calculators

Guides & articles