Regression(Linear Models)

Linear Regression #

Linear Regression is a supervised ML method used to predict a numerical target by fitting a model that is linear in its parameters.

In ML , linear models are a core baseline: they’re fast, often surprisingly strong, and usually easy to interpret.

Key takeaway: Linear Regression learns parameters by minimising a squared-error cost. You can solve it directly (closed form) or iteratively (gradient descent), and you can extend it using basis functions and regularisation.

flowchart TD
T["Linear<br/>regression<br/>models"] --> SL["Simple<br/>linear"]
T --> ML["Multiple<br/>linear"]
T --> PR["Polynomial<br/>(linear in params)"]
T --> R["Ridge<br/>(L2)"]
T --> L["Lasso<br/>(L1)"]
T --> EN["Elastic<br/>Net"]

SL -->|1 feature| X1["One<br/>predictor"]
ML -->|many features| XM["Multiple<br/>predictors"]
PR -->|feature mapping| PHI["Basis<br/>functions"]

R -->|shrinks| W2["Weights"]
L -->|selects| SP["Sparse<br/>weights"]
EN -->|mixes| MIX["L1 + L2"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style SL fill:#C8E6C9,stroke:#2E7D32,color:#000
style ML fill:#C8E6C9,stroke:#2E7D32,color:#000
style PR fill:#C8E6C9,stroke:#2E7D32,color:#000
style R fill:#C8E6C9,stroke:#2E7D32,color:#000
style L fill:#C8E6C9,stroke:#2E7D32,color:#000
style EN fill:#C8E6C9,stroke:#2E7D32,color:#000

style X1 fill:#CE93D8,stroke:#8E24AA,color:#000
style XM fill:#CE93D8,stroke:#8E24AA,color:#000
style PHI fill:#CE93D8,stroke:#8E24AA,color:#000
style W2 fill:#CE93D8,stroke:#8E24AA,color:#000
style SP fill:#CE93D8,stroke:#8E24AA,color:#000
style MIX fill:#CE93D8,stroke:#8E24AA,color:#000

Why Linear Regression #

Common reasons to use linear regression:

  • It can be built relatively easily and serves as a strong baseline.
  • It is often more interpretable than “black-box” models.
  • In practice, business/client constraints may require interpretability, even if that trades off some accuracy.

Simple vs Multiple Linear Regression #

Simple linear regression (one predictor variable):

\[ y = \beta_0 + \beta_1 x \]

Multiple linear regression (many predictors):

\[ y = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n \]

Matrix form (useful for implementation in ML ):

  • Design matrix: $X \in \mathbb{R}^{n \times (d+1)}$ (with a column of 1s for the intercept)
  • Parameters: $w \in \mathbb{R}^{d+1}$
  • Targets: $y \in \mathbb{R}^{n}$
\[ \hat{y} = Xw \]

Worked Example: Age vs Distance Visible #

A sample of drivers was collected to study: Question: How strong is the linear relationship between age and the distance a driver can see?

  • Predictor (x-axis): Age (years)
  • Response (y-axis): Distance visible (ft)

A fitted “line of best fit” from the lecture is:

\[ \text{dist} = -3.0068(\text{Age}) + 576.6819 \]

Interpretation: For every 1 unit increase in age, the visibility distance decreases by approximately 3 units.


Correlation #

Correlation measures linear dependence only. A low correlation does not mean “no relationship”: It may mean the data is poorly explained by a linear model.

Also: Being able to fit a line does not necessarily mean the model is good.


Covariance vs correlation #

Covariance:

\[ \mathrm{cov}(x,y)=\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}) \]
  • sign tells direction of relationship
  • units depend on the scale of $x$ and $y$ (harder to compare across datasets)

Correlation:

\[ r=\frac{\mathrm{cov}(x,y)}{\sigma_x\sigma_y} \]
  • dimensionless and bounded in $[-1,1]$
  • easier to interpret and compare

R-squared (R²) #

R² measures how much variance the model explains (compared to predicting the mean):

\[ R^2=1-\frac{SSE}{SST} \]

Where:

  • $SSE=\sum (y_i-\hat{y}_i)^2$
  • $SST=\sum (y_i-\bar{y})^2$

Interpretation:

  • higher R² usually means better fit
  • very high R² on training data can indicate overfitting (check validation/test)

Underfitting vs overfitting (quick diagnosis) #

  • high training error and high test error: underfitting (high bias)
  • low training error but high test error: overfitting (high variance)

Interpreting a log-transform (quick note) #

If you model $\log(y)$ as linear:

\[ \log(y)=\beta_0+\beta_1 x \]

Then a 1-unit increase in $x$ multiplies $y$ by $e^{\beta_1}$ (approximately a percentage change effect).


Direct Solution Method #

In regression we pick parameters to minimise prediction error. A common choice is the least squares method.

  • mathematical optimisation method used to find the best-fitting curve or line through a set of data points
  • by minimising the sum of the squared residuals (differences) between observed and predicted values
  • commonly used in regression analysis to determine the line of best fit

For the full OLS derivation and closed-form solutions:

  • /docs/ai/machine-learning/03-ordinary-least-squares/

Iterative Method #

Gradient Descent (batch/stochastic/mini-batch) #

When a direct solution is expensive (or you prefer iterative optimisation), you can minimise the cost using gradient descent.

Full gradient descent notes (types, gradients, updates):

  • /docs/ai/machine-learning/03-gradient-descent-linear-regression/

Cost function definition:

  • /docs/ai/machine-learning/03-cost-function/

Linear basis function models #

A key idea in ML is: you can make linear regression more powerful by transforming inputs.

Instead of using $x$ directly, we use a feature mapping $\phi(x)$:

\[ \phi(x) = [\phi_0(x), \phi_1(x), \dots, \phi_M(x)] \]

Then the model becomes:

\[ y \approx \sum_{j=0}^{M} w_j \phi_j(x) \]

Important: This can be non-linear in $x$, but it is still linear in the parameters $w$.

Common basis functions #

Polynomial basis: $\phi(x) = [1, x, x^2, \dots, x^M]$

Radial basis function (Gaussian / RBF): centres $\mu_j$, width $\sigma$

\[ \phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{2\sigma^2}\right) \]

Piecewise / spline-like basis: useful when the relationship changes behaviour across different input ranges.

Why this matters #

More basis functions:

  • usually reduces bias (more flexibility)
  • can increase variance (risk of overfitting)

This leads directly to: bias–variance decomposition.


Bias-variance decomposition #

For squared error, expected prediction error at input $x$ can be decomposed into:

\[ \mathbb{E}\left[(y-\hat{f}(x))^2\right] = \underbrace{\left(\text{Bias}[\hat{f}(x)]\right)^2}_{\text{systematic error}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{sensitivity to training data}} + \underbrace{\sigma^2}_{\text{irreducible noise}} \]

Meaning:

  • Bias: error from overly simple assumptions (underfitting)
  • Variance: error from sensitivity to the particular training sample (overfitting)
  • Noise: error you cannot remove (measurement noise, missing variables)

Underfitting vs overfitting #

Underfitting: high bias, low variance

Overfitting: low bias, high variance

Typical pattern: as model complexity increases (e.g., higher-degree polynomial basis), bias tends to go down and variance tends to go up.

Practical diagnosis #

Compare: training error vs test error

  • high training error and high test error: likely high bias (model too simple)
  • low training error but high test error: likely high variance (model too complex)

Checklist #

When you build a linear regression model, always check:

  • Is a linear model shape sensible for this relationship?
  • Do residuals show patterns (suggesting non-linearity)?
  • Are there outliers strongly influencing the fit?
  • Does the model generalise (train vs test performance)?
  • Are features scaled (especially for gradient descent)?
  • If you use basis functions: are you controlling overfitting (validation, regularisation, early stopping)?

References #


Home | Machine Learning