Direct solution method - Ordinary Least Squares and the Line of Best Fit #
It is possible to compute the best parameters for linear regression in one shot (closed-form), instead of iteratively improving them step-by-step. fileciteturn34file10turn34file6
For linear regression, the direct method is usually Ordinary Least Squares (OLS). fileciteturn34file0
Ordinary Least Squares (OLS) chooses the “best” line by minimising squared prediction errors. fileciteturn34file0turn34file5
Key takeaway: OLS defines “best fit” as the line that minimises the total squared residual error across all data points.
Predicted value and residual #
For each data point $(x_i, y_i)$, the model predicts:
\[ \hat{y}_i = \beta_0 + \beta_1 x_i \]The residual (error) is:
\[ r_i = y_i - \hat{y}_i \]Residual intuition:
- Positive residual: the point is above the line
- Negative residual: the point is below the line
Sum of squared errors (SSE) #
Ordinary Least Squares chooses $\beta_0, \beta_1$ to minimise the sum of squared errors:
\[ SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]Why square the residuals:
- prevents positive and negative errors cancelling out
- penalises large errors more strongly
- produces a smooth objective that is easier to optimise
Closed-form solution (single predictor) #
When there is a single predictor variable, the solution can be written using covariance and variance:
\[ \beta_1 = \frac{\mathrm{cov}(x,y)}{\mathrm{var}(x)} \]And the intercept:
\[ \beta_0 = \bar{y} - \beta_1 \bar{x} \]Interpretation:
- $\beta_1$ is the slope (how much $y$ changes per unit change in $x$)
- $\beta_0$ is the intercept (predicted value when $x=0$)
Exam template: solve an OLS numerical (simple linear regression) #
If the question gives a small table of $(x_i, y_i)$ and asks “find the best-fit line / optimal $\theta_0,\theta_1$” (with no learning rate / iteration info), use OLS. fileciteturn34file10turn34file13
Step-by-step (what you write in the exam) #
- Compute means:
- Compute the two sums:
- Slope and intercept:
- Final regression line: $\hat{y}=\beta_0+\beta_1 x$.
Matrix view (general linear regression framework) #
Your lecturer emphasised that the matrix calculation is very important for exam numericals and that you can compute using either:
- direct matrix multiplication, or
- covariance/variance form (single-feature case). fileciteturn34file10turn34file9
Design matrix (bias term) #
For linear regression, we include a bias term by setting the first feature $x_0=1$ for every record. fileciteturn34file17
Example (one feature):
- parameters: $\theta = [\theta_0,\theta_1]^T$
- design matrix: $X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \dots & \dots \\ 1 & x_n \end{bmatrix}$
- target vector: $y = [y_1,\dots,y_n]^T$
Normal equation (closed-form) #
\[ \theta^* = (X^T X)^{-1}X^T y \]Dimension check (quick sanity):
- $X$: $n\times d$
- $X^T X$: $d\times d$
- $(X^T X)^{-1}X^T y$: $d\times 1$ (same shape as $\theta$)
Multicollinearity (perfectly correlated features) #
If two (or more) input features are perfectly correlated, the design matrix becomes rank-deficient: $X^T X$ can become singular.
Practical result: the OLS solution may not be unique (many parameter vectors fit equally well). fileciteturn34file5turn34file0
What you do in practice:
- remove/reduce redundant features
- or use regularisation (e.g. ridge)
OLS vs Gradient Descent (how to recognise which one to use) #
Your lecturer explained:
- gradient descent is iterative and needs a learning rate $\alpha$,
- closed-form OLS is non-iterative and does not need $\alpha$,
- for large datasets / many features, closed-form becomes heavy because of matrix inversion, so gradient descent is preferred in practice. fileciteturn34file13turn34file6turn34file2
Use OLS (direct / closed form) when the question includes: #
- “find the optimal / best-fit $\theta_0,\theta_1$”
- “using OLS / normal equation / matrix method”
- “compute $\theta=(X^T X)^{-1}X^T y$”
- no learning rate $\alpha$ and no “iterations” wording fileciteturn34file9turn34file10turn34file13
Use gradient descent (iterative) when the question includes: #
- initial values (e.g. $\theta_0=0.1,\theta_1=0.5$)
- learning rate $\alpha$ (or $\eta$)
- “show only the first iteration / two iterations / update rule”
- “batch vs stochastic vs mini-batch” fileciteturn34file7turn34file6turn34file19
What to watch out for in numerical problems (quick cues) #
- If you see $\alpha$ and “one iteration” → it is definitely gradient descent.
- If you see $(X^T X)^{-1}$ or “normal equation” → it is definitely OLS.
- If it is a small table and they ask “optimal line” → use OLS unless told “use GD”.
- If features are large-scale (e.g. $x$ in thousands), GD updates may jump a lot → feature scaling helps. fileciteturn34file14turn34file15