Direct solution method - Ordinary Least Squares and the Line of Best Fit #
It is possible to compute the best parameters for linear regression in one shot (closed-form), instead of iteratively improving them step-by-step. fileciteturn34file10turn34file6
For linear regression, the direct method is usually Ordinary Least Squares (OLS).
Ordinary Least Squares (OLS) chooses the “best” line by minimising squared prediction errors.
Key takeaway: OLS defines “best fit” as the line that minimises the total squared residual error across all data points.
Predicted value and residual #
For each data point $(x_i, y_i)$, the model predicts:
\[ \hat{y}_i = \beta_0 + \beta_1 x_i \]The residual (error) is:
\[ r_i = y_i - \hat{y}_i \]Residual intuition:
- Positive residual: the point is above the line
- Negative residual: the point is below the line
Sum of squared errors (SSE) #
Ordinary Least Squares chooses $\beta_0, \beta_1$ to minimise the sum of squared errors:
\[ SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]Why square the residuals:
- Prevents positive and negative errors cancelling out
- Penalises large errors more strongly
- Produces a smooth objective that is easier to optimise
What OLS is minimising #
OLS chooses parameters to minimise the sum of squared residuals:
\[ SSE=\sum_{i=1}^{n}\left(y_i-\hat{y}_i\right)^2 \]This is why OLS is called “least squares”: it penalises large errors more heavily than small ones.
Multicollinearity (perfectly correlated features) #
If two (or more) input features are perfectly correlated, the design matrix becomes rank-deficient: $X^T X$ can become singular.
Practical result: the OLS solution may not be unique (many parameter vectors fit equally well).
What you do in practice:
- remove/reduce redundant features
- or use regularisation (e.g. ridge)
Closed-form solution (single predictor) #
When there is a single predictor variable, the solution can be written using covariance and variance:
\[ \beta_1 = \frac{\mathrm{cov}(x,y)}{\mathrm{var}(x)} \]And the intercept:
\[ \beta_0 = \bar{y} - \beta_1 \bar{x} \]Interpretation:
- $\beta_1$ is the slope (how much $y$ changes per unit change in $x$)
- $\beta_0$ is the intercept (predicted value when $x=0$)
OLS numerical (simple linear regression) #
If the question gives a small table of $(x_i, y_i)$ and asks “find the best-fit line / optimal $\theta_0,\theta_1$” (with no learning rate / iteration info), use OLS.
Step-by-step (what you write in the exam) #
- Compute means:
First, calculate the mean of x and the mean of y.
\[ \bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i, \qquad \bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_i \]- Compute the two sums:
- Slope and intercept:
- Final regression line: $\hat{y}=\beta_0+\beta_1 x$.
Step-by-step (OLS for simple linear regression) - alternate representation #
Identify the given data
- Independent variable: $X$ (Minutes Studied)
- Dependent variable: $Y$ (Score)
Calculate the means
- Compute the mean of $X$: $\,\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i$
- Compute the mean of $Y$: $\,\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_i$
Determine the slope ($\theta_1$) OLS finds the slope:
Compute the components:
| $x_i$ | $y_i$ | $x_i-\bar{x}$ | $y_i-\bar{y}$ | $(x_i-\bar{x})(y_i-\bar{y})$ | $(x_i-\bar{x})^2$ |
|---|---|---|---|---|---|
| 10 | 60 | -15 | -11.25 | 168.75 | 225 |
| 20 | 65 | -5 | -6.25 | 31.25 | 25 |
| 30 | 75 | 5 | 3.75 | 18.75 | 25 |
| 40 | 85 | 15 | 13.75 | 206.25 | 225 |
| Sum | 425 | 500 |
So: $\theta_1=\frac{425}{500}$
- Determine the intercept ($\theta_0$) Using the calculated $\theta_1$:
Matrix view (general linear regression framework) #
Matrix calculation is very important for exam numericals can compute using either:
- direct matrix multiplication, or
- covariance/variance form (single-feature case).
Design matrix (bias term) #
For linear regression, we include a bias term by setting the first feature $x_0=1$ for every record. fileciteturn34file17
Example (one feature):
- parameters: $\theta = [\theta_0,\theta_1]^T$
- design matrix: $X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \dots & \dots \\ 1 & x_n \end{bmatrix}$
- target vector: $y = [y_1,\dots,y_n]^T$
Normal equation (closed-form) #
\[ \theta^* = (X^T X)^{-1}X^T y \]Dimension check (quick sanity):
- $X$: $n\times d$
- $X^T X$: $d\times d$
- $(X^T X)^{-1}X^T y$: $d\times 1$ (same shape as $\theta$)
Multicollinearity (perfectly correlated features) #
If two (or more) input features are perfectly correlated, the design matrix becomes rank-deficient: $X^T X$ can become singular.
Practical result: the OLS solution may not be unique (many parameter vectors fit equally well). fileciteturn34file5turn34file0
What you do in practice:
- remove/reduce redundant features
- or use regularisation (e.g. ridge)
OLS vs Gradient Descent (how to recognise which one to use) #
Your lecturer explained:
- gradient descent is iterative and needs a learning rate $\alpha$,
- closed-form OLS is non-iterative and does not need $\alpha$,
- for large datasets / many features, closed-form becomes heavy because of matrix inversion, so gradient descent is preferred in practice. fileciteturn34file13turn34file6turn34file2
Use OLS (direct / closed form) when the question includes: #
- “find the optimal / best-fit $\theta_0,\theta_1$”
- “using OLS / normal equation / matrix method”
- “compute $\theta=(X^T X)^{-1}X^T y$”
- no learning rate $\alpha$ and no “iterations” wording fileciteturn34file9turn34file10turn34file13
Use gradient descent (iterative) when the question includes: #
- initial values (e.g. $\theta_0=0.1,\theta_1=0.5$)
- learning rate $\alpha$ (or $\eta$)
- “show only the first iteration / two iterations / update rule”
- “batch vs stochastic vs mini-batch” fileciteturn34file7turn34file6turn34file19
What to watch out for in numerical problems (quick cues) #
- If you see $\alpha$ and “one iteration” → it is definitely gradient descent.
- If you see $(X^T X)^{-1}$ or “normal equation” → it is definitely OLS.
- If it is a small table and they ask “optimal line” → use OLS unless told “use GD”.
- If features are large-scale (e.g. $x$ in thousands), GD updates may jump a lot → feature scaling helps. fileciteturn34file14turn34file15
Summary #
Ordinary Least Squares is:
- Fast and standard for many regression problems
- Sensitive to outliers (because squaring makes large errors dominate)
- A foundation for extensions like regularisation (ridge, LASSO)