Hypothesis Testing #

Hypothesis testing is a structured way to decide:

Is what we see in a sample just random variation, or is there evidence of a real effect in the population?

Hypothesis Testing topic sits inside inferential statistics: we use a sample to make a statement about a population.

Sampling (random and stratified)
Sampling distribution and Central Limit Theorem
Estimation (confidence intervals and confidence level)
Testing hypotheses (mean, proportion, ANOVA)
Maximum likelihood (MLE)

Key takeaway: The logic is always the same:
define a claim (null) and a competing claim (alternative)
compute a test statistic from the sample
use a threshold (critical value) or a p-value
make a decision and write the conclusion in words

Big picture roadmap #

%%{init: {'theme':'base','themeVariables': {
  'fontFamily':'Inter, ui-sans-serif, system-ui',
  'primaryColor':'#E8F1FF',
  'primaryTextColor':'#1F2937',
  'primaryBorderColor':'#A7C7FF',
  'lineColor':'#94A3B8',
  'tertiaryColor':'#F8FAFC'
}}}%%
flowchart TD
  A["Population<br/>unknown truth"] --> B["Sampling<br/>random / stratified"]
  B --> C["Statistic from sample<br/>mean / proportion / variance"]
  C --> D["Sampling distribution<br/>standard error"]
  D --> E["Estimation<br/>confidence intervals"]
  D --> F["Hypothesis test<br/>Z / t / proportion / F"]
  E --> G["Interpretation<br/>confidence level, margin of error"]
  F --> H["Decision<br/>reject H0 or fail to reject H0"]
  H --> I["Report<br/>p-value + conclusion in words"]
  G --> I

Key takeaway: Hypothesis testing depends on the idea of a sampling distribution. That is why we start with sampling and the Central Limit Theorem.

1) Sampling #

1.1 Population vs sample #

Population: the full set of elements you care about (usually huge).
Sample: a subset of the population that you can actually measure.

Why sample? because measuring the full population can be too expensive in: money, manpower, material, and time.

1.2 Random sampling #

Random sampling means: each element has a known chance of selection, and selection is not biased.

Typical approach:

assign IDs to the population list (sampling frame)
use a random number generator to pick IDs

1.3 Stratified sampling #

Use stratified sampling when the population is heterogeneous.

Steps:

split population into strata (groups)
take a random sample within each stratum
combine results using weights proportional to stratum sizes

Why it helps: it prevents under-representing important subgroups.

Sampling error and sampling variation #

Sampling is random, so estimates vary from sample to sample.

Sampling variation: the natural variability of sample statistics across different samples.
Sampling error: the difference between a sample estimate and the true (unknown) population parameter.

A useful mental model:

Observed data = Truth + Bias + Random error Sampling design tries to reduce bias and control random error.

2) Sampling distributions and the Central Limit Theorem #

2.1 Sampling distribution #

A statistic (like the sample mean) becomes a random variable if we repeat sampling. Its distribution over repeated samples is the sampling distribution.

This is the key reason we can quantify uncertainty.

2.2 Central Limit Theorem for the sample mean #

If you take a random sample of size \( n \) from a population with mean \( \mu \) and standard deviation \( \sigma \) :

for sufficiently large \( n \) (rule of thumb: \( n\ge 30 \) ), the distribution of the sample mean \( \bar{x} \) is approximately normal
if the population itself is normal, then \( \bar{x} \) is normal for any \( n \)

Mean and standard deviation of the sampling distribution:

\[ E(\bar{x})=\mu \] \[ \sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}} \]

The quantity \( \sigma_{\bar{x}} \) is the standard error of the mean.

2.3 Z-score for sample means #

\[ Z=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \]

This lets you convert a sample mean into a standard normal value, so you can compute probabilities and thresholds.

2.4 CLT for the sample proportion #

For a binary variable (success/failure), let \( \hat{p}=x/n \) .

Approximate normality holds when: \( np \) and \( n(1-p) \) are both sufficiently large.

Mean and standard error:

\[ E(\hat{p})=p \] \[ \sigma_{\hat{p}}=\sqrt{\frac{p(1-p)}{n}} \]

In practice, when \( p \) is unknown, we often plug in \( \hat{p} \) .

3) Estimation (confidence intervals) #

3.1 Confidence level #

A confidence level is written as: \( 100(1-\alpha)\% \) .

Common values:

90% ( \( \alpha=0.10 \) )
95% ( \( \alpha=0.05 \) )
99% ( \( \alpha=0.01 \) )

3.2 Confidence interval for the mean (sigma known) #

Standard error:

\[ SE(\bar{x})=\frac{\sigma}{\sqrt{n}} \]

Confidence interval:

\[ \mu\in \bar{x}\pm Z_{\alpha/2}\,\frac{\sigma}{\sqrt{n}} \]

Margin of error:

\[ E=Z_{\alpha/2}\,\frac{\sigma}{\sqrt{n}} \]

Interpretation: the method produces intervals that capture \( \mu \) about \( 100(1-\alpha)\% \) of the time under repeated sampling.

3.3 Confidence interval for a proportion #

Standard error (using \( \hat{p} \) ):

\[ SE(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Confidence interval:

\[ p\in \hat{p}\pm Z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

3.4 Difference of two means (large-sample form) #

\[ (\mu_1-\mu_2)\in (\bar{x}_1-\bar{x}_2)\pm Z_{\alpha/2}\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}} \]

3.5 Difference of two proportions #

\[ SE(\hat{p}_1-\hat{p}_2)=\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \] \[ (p_1-p_2)\in (\hat{p}_1-\hat{p}_2)\pm Z_{\alpha/2}\,SE(\hat{p}_1-\hat{p}_2) \]

4) Testing of hypothesis #

4.1 Hypotheses #

Null hypothesis \( H_0 \) : baseline claim (no change / no effect)
Alternative hypothesis \( H_1 \) : competing claim (change / effect)

A hypothesis test outputs:

a test statistic ( \( Z \) , \( t \) , \( \chi^2 \) , \( F \) )
a p-value (or a critical region decision)
a final conclusion

4.2 Errors in testing #

Type I error: reject \( H_0 \) when \( H_0 \) is true
probability = \( \alpha \)
Type II error: fail to reject \( H_0 \) when \( H_1 \) is true
probability = \( \beta \)

Power of a test: \( 1-\beta \)

4.3 Test workflow #

%%{init: {'theme':'base','themeVariables': {
  'fontFamily':'Inter, ui-sans-serif, system-ui',
  'primaryColor':'#FDEBFF',
  'primaryTextColor':'#1F2937',
  'primaryBorderColor':'#F5B7FF',
  'lineColor':'#94A3B8',
  'tertiaryColor':'#F8FAFC'
}}}%%
flowchart TD
  A["Define hypotheses<br/>H0 and H1"] --> B["Choose significance level<br/>alpha = 0.10 or 0.05 or 0.01"]
  B --> C["Select test statistic<br/>Z or t or chi-square or F"]
  C --> D["Compute statistic<br/>from sample data"]
  D --> E["Compute p-value<br/>or compare with critical value"]
  E --> F["Decision"]
  F --> G["If p <= alpha<br/>Reject H0"]
  F --> H["If p > alpha<br/>Fail to reject H0"]
  G --> I["Conclusion in words"]
  H --> I

5) Mean-based tests (one mean) #

Use this when testing a claim about \( \mu \) .

A common Z-form (when \( \sigma \) is known):

\[ Z=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}} \]

Where \( \mu_0 \) is the mean stated in \( H_0 \) .

Decision approaches:

critical value method: compare \( Z \) to \( Z_{\alpha/2} \) (two-tailed) or \( Z_{\alpha} \) (one-tailed)
p-value method: compute the p-value from the Z table

6) Proportion-based tests (one proportion) #

Use when outcomes are binary and you test a claim about \( p \) .

\[ Z=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \]

Where \( p_0 \) is the proportion stated in \( H_0 \) .

7) ANOVA (single and dual factor) #

ANOVA is used when comparing more than two means.

Instead of running many pairwise tests (which inflates Type I error), ANOVA uses an \( F \) statistic.

Single-factor ANOVA: one categorical factor (one grouping variable)
Two-factor ANOVA: two factors (and possibly an interaction effect)

Core idea: compare variation between groups to variation within groups. If between-group variation is much larger, at least one mean is different.

8) Maximum likelihood (MLE) #

Maximum likelihood estimates model parameters by choosing values that make the observed data most probable.

Let the data be \( x_1,\dots,x_n \) and parameter \( \theta \) .

Likelihood:

\[ L(\theta)=\prod_{i=1}^{n} f(x_i\mid \theta) \]

Log-likelihood:

\[ \ell(\theta)=\log L(\theta)=\sum_{i=1}^{n} \log f(x_i\mid \theta) \]

MLE:

\[ \hat{\theta}_{\text{MLE}}=\arg\max_{\theta}\; L(\theta) =\arg\max_{\theta}\; \ell(\theta) \]

Quick examples:

Bernoulli trials: \( \hat{p}=x/n \)
Normal with known \( \sigma \) : \( \hat{\mu}=\bar{x} \)

Mini-check (self-test) #

What does “95% confidence” mean in repeated sampling?
What is the standard error of the mean?
In hypothesis testing, what does \( \alpha \) represent?
What is the difference between “reject \( H_0 \) ” and “fail to reject \( H_0 \) ”?
Why is ANOVA preferred over multiple pairwise mean tests?

Answers:

The method captures the true parameter about 95% of the time over repeated samples.
\( \sigma/\sqrt{n} \) (or with finite population correction if needed).
Probability of Type I error.
Fail to reject means evidence is not strong enough; it does not mean \( H_0 \) is proven true.
It controls Type I error inflation when comparing several means.

Reference #

Hypothesis Testing

Home | Statistics