Linear Regression #

Linear Regression is a supervised ML method used to predict a numerical target by fitting a model that is linear in its parameters.

In ML , linear models are a core baseline: they’re fast, often surprisingly strong, and usually easy to interpret.

Key takeaway: Linear Regression learns parameters by minimising a squared-error cost. You can solve it directly (closed form) or iteratively (gradient descent), and you can extend it using basis functions and regularisation.

flowchart TD
T["Linear<br/>regression<br/>models"] --> SL["Simple<br/>linear"]
T --> ML["Multiple<br/>linear"]
T --> PR["Polynomial<br/>(linear in params)"]
T --> R["Ridge<br/>(L2)"]
T --> L["Lasso<br/>(L1)"]
T --> EN["Elastic<br/>Net"]

SL -->|1 feature| X1["One<br/>predictor"]
ML -->|many features| XM["Multiple<br/>predictors"]
PR -->|feature mapping| PHI["Basis<br/>functions"]

R -->|shrinks| W2["Weights"]
L -->|selects| SP["Sparse<br/>weights"]
EN -->|mixes| MIX["L1 + L2"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style SL fill:#C8E6C9,stroke:#2E7D32,color:#000
style ML fill:#C8E6C9,stroke:#2E7D32,color:#000
style PR fill:#C8E6C9,stroke:#2E7D32,color:#000
style R fill:#C8E6C9,stroke:#2E7D32,color:#000
style L fill:#C8E6C9,stroke:#2E7D32,color:#000
style EN fill:#C8E6C9,stroke:#2E7D32,color:#000

style X1 fill:#CE93D8,stroke:#8E24AA,color:#000
style XM fill:#CE93D8,stroke:#8E24AA,color:#000
style PHI fill:#CE93D8,stroke:#8E24AA,color:#000
style W2 fill:#CE93D8,stroke:#8E24AA,color:#000
style SP fill:#CE93D8,stroke:#8E24AA,color:#000
style MIX fill:#CE93D8,stroke:#8E24AA,color:#000

Why Linear Regression #

Common reasons to use linear regression:

It can be built relatively easily and serves as a strong baseline.
It is often more interpretable than “black-box” models.
In practice, business/client constraints may require interpretability, even if that trades off some accuracy.

Simple vs Multiple Linear Regression #

Simple linear regression (one predictor variable):

\[ y = \beta_0 + \beta_1 x \]

Multiple linear regression (many predictors):

\[ y = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n \]

Matrix form (useful for implementation in ML ):

Design matrix: $X \in \mathbb{R}^{n \times (d+1)}$ (with a column of 1s for the intercept)
Parameters: $w \in \mathbb{R}^{d+1}$
Targets: $y \in \mathbb{R}^{n}$

\[ \hat{y} = Xw \]

Worked Example: Age vs Distance Visible #

A sample of drivers was collected to study: Question: How strong is the linear relationship between age and the distance a driver can see?

Predictor (x-axis): Age (years)
Response (y-axis): Distance visible (ft)

A fitted “line of best fit” from the lecture is:

\[ \text{dist} = -3.0068(\text{Age}) + 576.6819 \]

Interpretation: For every 1 unit increase in age, the visibility distance decreases by approximately 3 units.

Correlation #

Correlation measures linear dependence only. A low correlation does not mean “no relationship”: It may mean the data is poorly explained by a linear model.

Also: Being able to fit a line does not necessarily mean the model is good.

Covariance vs correlation #

Covariance:

\[ \mathrm{cov}(x,y)=\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}) \]

sign tells direction of relationship
units depend on the scale of $x$ and $y$ (harder to compare across datasets)

Correlation:

\[ r=\frac{\mathrm{cov}(x,y)}{\sigma_x\sigma_y} \]

dimensionless and bounded in $[-1,1]$
easier to interpret and compare

R-squared (R²) #

R² measures how much variance the model explains (compared to predicting the mean):

\[ R^2=1-\frac{SSE}{SST} \]

Where:

$SSE=\sum (y_i-\hat{y}_i)^2$
$SST=\sum (y_i-\bar{y})^2$

Interpretation:

higher R² usually means better fit
very high R² on training data can indicate overfitting (check validation/test)

Underfitting vs overfitting (quick diagnosis) #

high training error and high test error: underfitting (high bias)
low training error but high test error: overfitting (high variance)

Interpreting a log-transform (quick note) #

If you model $\log(y)$ as linear:

\[ \log(y)=\beta_0+\beta_1 x \]

Then a 1-unit increase in $x$ multiplies $y$ by $e^{\beta_1}$ (approximately a percentage change effect).

Direct Solution Method #

In regression we pick parameters to minimise prediction error. A common choice is the least squares method.

mathematical optimisation method used to find the best-fitting curve or line through a set of data points
by minimising the sum of the squared residuals (differences) between observed and predicted values
commonly used in regression analysis to determine the line of best fit

For the full OLS derivation and closed-form solutions:

/docs/ai/machine-learning/03-ordinary-least-squares/

Iterative Method #

Gradient Descent (batch/stochastic/mini-batch) #

When a direct solution is expensive (or you prefer iterative optimisation), you can minimise the cost using gradient descent.

Full gradient descent notes (types, gradients, updates):

/docs/ai/machine-learning/03-gradient-descent-linear-regression/

Cost function definition:

/docs/ai/machine-learning/03-cost-function/

Linear basis function models #

A key idea in ML is: you can make linear regression more powerful by transforming inputs.

Instead of using $x$ directly, we use a feature mapping $\phi(x)$:

\[ \phi(x) = [\phi_0(x), \phi_1(x), \dots, \phi_M(x)] \]

Then the model becomes:

\[ y \approx \sum_{j=0}^{M} w_j \phi_j(x) \]

Important: This can be non-linear in $x$, but it is still linear in the parameters $w$.

Common basis functions #

Polynomial basis: $\phi(x) = [1, x, x^2, \dots, x^M]$

Radial basis function (Gaussian / RBF): centres $\mu_j$, width $\sigma$

\[ \phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{2\sigma^2}\right) \]

Piecewise / spline-like basis: useful when the relationship changes behaviour across different input ranges.

Why this matters #

More basis functions:

usually reduces bias (more flexibility)
can increase variance (risk of overfitting)

This leads directly to: bias–variance decomposition.

Bias-variance decomposition #

For squared error, expected prediction error at input $x$ can be decomposed into:

\[ \mathbb{E}\left[(y-\hat{f}(x))^2\right] = \underbrace{\left(\text{Bias}[\hat{f}(x)]\right)^2}_{\text{systematic error}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{sensitivity to training data}} + \underbrace{\sigma^2}_{\text{irreducible noise}} \]

Meaning:

Bias: error from overly simple assumptions (underfitting)
Variance: error from sensitivity to the particular training sample (overfitting)
Noise: error you cannot remove (measurement noise, missing variables)

Underfitting vs overfitting #

Underfitting: high bias, low variance

Overfitting: low bias, high variance

Typical pattern: as model complexity increases (e.g., higher-degree polynomial basis), bias tends to go down and variance tends to go up.

Practical diagnosis #

Compare: training error vs test error

high training error and high test error: likely high bias (model too simple)
low training error but high test error: likely high variance (model too complex)

Checklist #

When you build a linear regression model, always check:

Is a linear model shape sensible for this relationship?
Do residuals show patterns (suggesting non-linearity)?
Are there outliers strongly influencing the fit?
Does the model generalise (train vs test performance)?
Are features scaled (especially for gradient descent)?
If you use basis functions: are you controlling overfitting (validation, regularisation, early stopping)?

References #

STAT 501 (Penn State) lesson used in the example
Tom Mitchell: Machine Learning (textbook reference for core concepts)

Home | Machine Learning