Ordinary Least Squares

Direct solution method - Ordinary Least Squares and the Line of Best Fit #

Revision:
OLS is the direct method for linear regression. It finds the best-fit line by minimising the sum of squared residuals without iterative updates.


Direct Method vs Iterative Method ☆ #

Linear regression parameters can be found in two main ways.

MethodMain ideaWhen used
Ordinary Least SquaresCompute the best parameters directlySmall or moderate datasets
Gradient DescentStart with parameters and update repeatedlyLarge datasets or many features
flowchart LR
    A["Linear Regression"] --> B["Direct Solution<br/>OLS"]
    A --> C["Iterative Solution<br/>Gradient Descent"]

    B --> B1["Normal Equation"]
    B --> B2["No learning rate"]
    B --> B3["One-shot solution"]

    C --> C1["Learning rate"]
    C --> C2["Repeated updates"]
    C --> C3["Stops after convergence"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#000
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#000
    style C fill:#FFF9C4,stroke:#b59b3b,color:#000
    style B1 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style B2 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style B3 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style C1 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style C2 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style C3 fill:#EDE7F6,stroke:#8a6fb3,color:#000

Why It Is Called “Least Squares” ☆ #

OLS is called least squares because it chooses parameters that make the squared residual errors as small as possible.

\[ \text{best parameters} = \arg\min_{\beta_0,\beta_1}\sum_{i=1}^{n}\left(y_i-\hat{y}_i\right)^2 \]

The word least means minimum.

The word squares refers to squared residuals.

It is possible to compute the best parameters for linear regression in one shot (closed-form), instead of iteratively improving them step-by-step.

For linear regression, the direct method is usually Ordinary Least Squares (OLS).

Ordinary Least Squares (OLS) chooses the “best” line by minimising squared prediction errors.

Key takeaway: OLS defines “best fit” as the line that minimises the total squared residual error across all data points.


Predicted value and residual #

For each data point $(x_i, y_i)$, the model predicts:

\[ \hat{y}_i = \beta_0 + \beta_1 x_i \]

The residual (error) is:

\[ r_i = y_i - \hat{y}_i \]

Residual intuition:

  • Positive residual: the point is above the line
  • Negative residual: the point is below the line

Sum of squared errors (SSE) #

Ordinary Least Squares chooses $\beta_0, \beta_1$ to minimise the sum of squared errors:

\[ SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

Why square the residuals:

  • Prevents positive and negative errors cancelling out
  • Penalises large errors more strongly
  • Produces a smooth objective that is easier to optimise

What OLS is minimising #

OLS chooses parameters to minimise the sum of squared residuals:

\[ SSE=\sum_{i=1}^{n}\left(y_i-\hat{y}_i\right)^2 \]

This is why OLS is called “least squares”: it penalises large errors more heavily than small ones.


Multicollinearity (perfectly correlated features) #

If two (or more) input features are perfectly correlated, the design matrix becomes rank-deficient: $X^T X$ can become singular.

Practical result: the OLS solution may not be unique (many parameter vectors fit equally well).

What you do in practice:

  • remove/reduce redundant features
  • or use regularisation (e.g. ridge)

Closed-form solution (single predictor) #

When there is a single predictor variable, the solution can be written using covariance and variance:

\[ \beta_1 = \frac{\mathrm{cov}(x,y)}{\mathrm{var}(x)} \]

And the intercept:

\[ \beta_0 = \bar{y} - \beta_1 \bar{x} \]

Interpretation:

  • $\beta_1$ is the slope (how much $y$ changes per unit change in $x$)
  • $\beta_0$ is the intercept (predicted value when $x=0$)

OLS numerical (simple linear regression) #

If the question gives a small table of $(x_i, y_i)$ and asks “find the best-fit line / optimal $\theta_0,\theta_1$” (with no learning rate / iteration info), use OLS.

Step-by-step (what you write in the exam) #

  1. Compute means:

First, calculate the mean of x and the mean of y.

\[ \bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i, \qquad \bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_i \]
  1. Compute the two sums:
\[ S_{xx}=\sum_{i=1}^{n}(x_i-\bar{x})^2, \qquad S_{xy}=\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}) \]
  1. Slope and intercept:
\[ \beta_1=\frac{S_{xy}}{S_{xx}}, \qquad \beta_0=\bar{y}-\beta_1\bar{x} \]
  1. Final regression line: $\hat{y}=\beta_0+\beta_1 x$.

Step-by-step (OLS for simple linear regression) - alternate representation #

  1. Identify the given data

    • Independent variable: $X$ (Minutes Studied)
    • Dependent variable: $Y$ (Score)
  2. Calculate the means

    • Compute the mean of $X$: $\,\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i$
    • Compute the mean of $Y$: $\,\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_i$
  3. Determine the slope ($\theta_1$) OLS finds the slope:

\[ \theta_1=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2} \]

Compute the components:

$x_i$$y_i$$x_i-\bar{x}$$y_i-\bar{y}$$(x_i-\bar{x})(y_i-\bar{y})$$(x_i-\bar{x})^2$
1060-15-11.25168.75225
2065-5-6.2531.2525
307553.7518.7525
40851513.75206.25225
Sum425500

So: $\theta_1=\frac{425}{500}$

  1. Determine the intercept ($\theta_0$) Using the calculated $\theta_1$:
\[ \theta_0=\bar{y}-\theta_1\bar{x} \]

Matrix view (general linear regression framework) #

Matrix calculation is very important for exam numericals can compute using either:

  • direct matrix multiplication, or
  • covariance/variance form (single-feature case).

Design matrix (bias term) #

For linear regression, we include a bias term by setting the first feature $x_0=1$ for every record. fileciteturn34file17

Example (one feature):

  • parameters: $\theta = [\theta_0,\theta_1]^T$
  • design matrix: $X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \dots & \dots \\ 1 & x_n \end{bmatrix}$
  • target vector: $y = [y_1,\dots,y_n]^T$

Normal equation (closed-form) #

\[ \theta^* = (X^T X)^{-1}X^T y \]

Dimension check (quick sanity):

  • $X$: $n\times d$
  • $X^T X$: $d\times d$
  • $(X^T X)^{-1}X^T y$: $d\times 1$ (same shape as $\theta$)

Multicollinearity (perfectly correlated features) #

If two (or more) input features are perfectly correlated, the design matrix becomes rank-deficient: $X^T X$ can become singular.

Practical result: the OLS solution may not be unique (many parameter vectors fit equally well). fileciteturn34file5turn34file0

What you do in practice:

  • remove/reduce redundant features
  • or use regularisation (e.g. ridge)

OLS vs Gradient Descent (how to recognise which one to use) #

Your lecturer explained:

  • gradient descent is iterative and needs a learning rate $\alpha$,
  • closed-form OLS is non-iterative and does not need $\alpha$,
  • for large datasets / many features, closed-form becomes heavy because of matrix inversion, so gradient descent is preferred in practice. fileciteturn34file13turn34file6turn34file2

Use OLS (direct / closed form) when the question includes: #

  • “find the optimal / best-fit $\theta_0,\theta_1$”
  • “using OLS / normal equation / matrix method
  • “compute $\theta=(X^T X)^{-1}X^T y$”
  • no learning rate $\alpha$ and no “iterations” wording fileciteturn34file9turn34file10turn34file13

Use gradient descent (iterative) when the question includes: #

  • initial values (e.g. $\theta_0=0.1,\theta_1=0.5$)
  • learning rate $\alpha$ (or $\eta$)
  • “show only the first iteration / two iterations / update rule”
  • “batch vs stochastic vs mini-batch” fileciteturn34file7turn34file6turn34file19

What to watch out for in numerical problems (quick cues) #

  • If you see $\alpha$ and “one iteration” → it is definitely gradient descent.
  • If you see $(X^T X)^{-1}$ or “normal equation” → it is definitely OLS.
  • If it is a small table and they ask “optimal line” → use OLS unless told “use GD”.
  • If features are large-scale (e.g. $x$ in thousands), GD updates may jump a lot → feature scaling helps. fileciteturn34file14turn34file15

Summary #

Ordinary Least Squares is:

  • Fast and standard for many regression problems
  • Sensitive to outliers (because squaring makes large errors dominate)
  • A foundation for extensions like regularisation (ridge, LASSO)

Notes #

  • OLS is a direct closed-form method.
  • It minimises the sum of squared residuals.
  • It does not require a learning rate.
  • It can be solved using covariance/variance for one feature.
  • It can be solved using the normal equation for matrix form.
  • Matrix inversion can become expensive for large feature spaces.
  • If features are perfectly correlated, the normal equation may fail because \( X^T X \) can become singular.

Revision #

Use OLS when the question asks for the best-fit line directly and does not mention iterations or learning rate.

\[ \theta^* = (X^T X)^{-1}X^T y \]

Summary #

Ordinary Least Squares finds the linear regression parameters that minimise squared error. It is mathematically clean and useful for small or moderate datasets. For very large datasets or high-dimensional data, gradient descent may be preferred.

References #


Home | Machine Learning