Linear Regression #
Linear Regression is a supervised ML method used to predict a numerical target by fitting a model that is linear in its parameters.
In ML , linear models are a core baseline: they’re fast, often surprisingly strong, and usually easy to interpret.
Key takeaway: Linear Regression learns parameters by minimising a squared-error cost. You can solve it directly (closed form) or iteratively (gradient descent), and you can extend it using basis functions and regularisation.
flowchart TD T["Linear<br/>regression<br/>models"] --> SL["Simple<br/>linear"] T --> ML["Multiple<br/>linear"] T --> PR["Polynomial<br/>(linear in params)"] T --> R["Ridge<br/>(L2)"] T --> L["Lasso<br/>(L1)"] T --> EN["Elastic<br/>Net"] SL -->|1 feature| X1["One<br/>predictor"] ML -->|many features| XM["Multiple<br/>predictors"] PR -->|feature mapping| PHI["Basis<br/>functions"] R -->|shrinks| W2["Weights"] L -->|selects| SP["Sparse<br/>weights"] EN -->|mixes| MIX["L1 + L2"] style T fill:#90CAF9,stroke:#1E88E5,color:#000 style SL fill:#C8E6C9,stroke:#2E7D32,color:#000 style ML fill:#C8E6C9,stroke:#2E7D32,color:#000 style PR fill:#C8E6C9,stroke:#2E7D32,color:#000 style R fill:#C8E6C9,stroke:#2E7D32,color:#000 style L fill:#C8E6C9,stroke:#2E7D32,color:#000 style EN fill:#C8E6C9,stroke:#2E7D32,color:#000 style X1 fill:#CE93D8,stroke:#8E24AA,color:#000 style XM fill:#CE93D8,stroke:#8E24AA,color:#000 style PHI fill:#CE93D8,stroke:#8E24AA,color:#000 style W2 fill:#CE93D8,stroke:#8E24AA,color:#000 style SP fill:#CE93D8,stroke:#8E24AA,color:#000 style MIX fill:#CE93D8,stroke:#8E24AA,color:#000
Why Linear Regression #
Common reasons to use linear regression:
- It can be built relatively easily and serves as a strong baseline.
- It is often more interpretable than “black-box” models.
- In practice, business/client constraints may require interpretability, even if that trades off some accuracy.
Simple vs Multiple Linear Regression #
Simple linear regression (one predictor variable):
\[ y = \beta_0 + \beta_1 x \]Multiple linear regression (many predictors):
\[ y = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n \]Matrix form (useful for implementation in ML ):
- Design matrix: $X \in \mathbb{R}^{n \times (d+1)}$ (with a column of 1s for the intercept)
- Parameters: $w \in \mathbb{R}^{d+1}$
- Targets: $y \in \mathbb{R}^{n}$
Worked Example: Age vs Distance Visible #
A sample of drivers was collected to study: Question: How strong is the linear relationship between age and the distance a driver can see?
- Predictor (x-axis): Age (years)
- Response (y-axis): Distance visible (ft)
A fitted “line of best fit” from the lecture is:
\[ \text{dist} = -3.0068(\text{Age}) + 576.6819 \]Interpretation: For every 1 unit increase in age, the visibility distance decreases by approximately 3 units.
Correlation #
Correlation measures linear dependence only. A low correlation does not mean “no relationship”: It may mean the data is poorly explained by a linear model.
Also: Being able to fit a line does not necessarily mean the model is good.
Covariance vs correlation #
Covariance:
\[ \mathrm{cov}(x,y)=\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}) \]- sign tells direction of relationship
- units depend on the scale of $x$ and $y$ (harder to compare across datasets)
Correlation:
\[ r=\frac{\mathrm{cov}(x,y)}{\sigma_x\sigma_y} \]- dimensionless and bounded in $[-1,1]$
- easier to interpret and compare
R-squared (R²) #
R² measures how much variance the model explains (compared to predicting the mean):
\[ R^2=1-\frac{SSE}{SST} \]Where:
- $SSE=\sum (y_i-\hat{y}_i)^2$
- $SST=\sum (y_i-\bar{y})^2$
Interpretation:
- higher R² usually means better fit
- very high R² on training data can indicate overfitting (check validation/test)
Underfitting vs overfitting (quick diagnosis) #
- high training error and high test error: underfitting (high bias)
- low training error but high test error: overfitting (high variance)
Interpreting a log-transform (quick note) #
If you model $\log(y)$ as linear:
\[ \log(y)=\beta_0+\beta_1 x \]Then a 1-unit increase in $x$ multiplies $y$ by $e^{\beta_1}$ (approximately a percentage change effect).
Direct Solution Method #
In regression we pick parameters to minimise prediction error. A common choice is the least squares method.
- mathematical optimisation method used to find the best-fitting curve or line through a set of data points
- by minimising the sum of the squared residuals (differences) between observed and predicted values
- commonly used in regression analysis to determine the line of best fit
For the full OLS derivation and closed-form solutions:
- /docs/ai/machine-learning/03-ordinary-least-squares/
Iterative Method #
Gradient Descent (batch/stochastic/mini-batch) #
When a direct solution is expensive (or you prefer iterative optimisation), you can minimise the cost using gradient descent.
Full gradient descent notes (types, gradients, updates):
- /docs/ai/machine-learning/03-gradient-descent-linear-regression/
Cost function definition:
- /docs/ai/machine-learning/03-cost-function/
Linear basis function models #
A key idea in ML is: you can make linear regression more powerful by transforming inputs.
Instead of using $x$ directly, we use a feature mapping $\phi(x)$:
\[ \phi(x) = [\phi_0(x), \phi_1(x), \dots, \phi_M(x)] \]Then the model becomes:
\[ y \approx \sum_{j=0}^{M} w_j \phi_j(x) \]Important: This can be non-linear in $x$, but it is still linear in the parameters $w$.
Common basis functions #
Polynomial basis: $\phi(x) = [1, x, x^2, \dots, x^M]$
Radial basis function (Gaussian / RBF): centres $\mu_j$, width $\sigma$
\[ \phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{2\sigma^2}\right) \]Piecewise / spline-like basis: useful when the relationship changes behaviour across different input ranges.
Why this matters #
More basis functions:
- usually reduces bias (more flexibility)
- can increase variance (risk of overfitting)
This leads directly to: bias–variance decomposition.
Bias-variance decomposition #
For squared error, expected prediction error at input $x$ can be decomposed into:
\[ \mathbb{E}\left[(y-\hat{f}(x))^2\right] = \underbrace{\left(\text{Bias}[\hat{f}(x)]\right)^2}_{\text{systematic error}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{sensitivity to training data}} + \underbrace{\sigma^2}_{\text{irreducible noise}} \]Meaning:
- Bias: error from overly simple assumptions (underfitting)
- Variance: error from sensitivity to the particular training sample (overfitting)
- Noise: error you cannot remove (measurement noise, missing variables)
Underfitting vs overfitting #
Underfitting: high bias, low variance
Overfitting: low bias, high variance
Typical pattern: as model complexity increases (e.g., higher-degree polynomial basis), bias tends to go down and variance tends to go up.
Practical diagnosis #
Compare: training error vs test error
- high training error and high test error: likely high bias (model too simple)
- low training error but high test error: likely high variance (model too complex)
Checklist #
When you build a linear regression model, always check:
- Is a linear model shape sensible for this relationship?
- Do residuals show patterns (suggesting non-linearity)?
- Are there outliers strongly influencing the fit?
- Does the model generalise (train vs test performance)?
- Are features scaled (especially for gradient descent)?
- If you use basis functions: are you controlling overfitting (validation, regularisation, early stopping)?
References #
- STAT 501 (Penn State) lesson used in the example
- Tom Mitchell: Machine Learning (textbook reference for core concepts)