Cost Function

Cost Function #

  • also known as an objective function

  • how far the predicted values are from the actual ones

  • measure of the difference between predicted values and actual values

  • quantifies the error between a model’s predicted values and actual values

  • measures the model’s error on a group of datapoints

  • method used to predict values by drawing the best-fit line through the data

  • used to evaluate the accuracy of a model’s predictions

  • guides the process of adjusting the model’s parameters in order to minimise the difference between predicted and actual values

flowchart TD
CF["Cost<br/>function"] -->|measures| ERR["Prediction<br/>error"]
CF -->|guides| OPT["Optimisation<br/>(training)"]

CF -->|includes| DATA["Data<br/>fit"]
CF -->|may include| PEN["Penalty<br/>(regularisation)"]

style CF fill:#90CAF9,stroke:#1E88E5,color:#000

style ERR fill:#CE93D8,stroke:#8E24AA,color:#000
style OPT fill:#CE93D8,stroke:#8E24AA,color:#000

style DATA fill:#C8E6C9,stroke:#2E7D32,color:#000
style PEN fill:#C8E6C9,stroke:#2E7D32,color:#000

Key takeaway: In ML, we choose model parameters to minimise a cost $J$. For linear regression, the most common choice is the squared error cost.


Training set and model #

You have a training set with:

  • input features $x$
  • output targets $y$

The linear regression model is:

\[ f_{w,b}(x)=wx+b \]

The values $w$ and $b$ are the parameters of the model. You adjust them during training to improve the model.

You may also hear:

  • $w,b$ called coefficients
  • $w,b$ called weights

What w and b do #

Different values of $w$ and $b$ give different straight lines.

  • $b$ is the y-intercept: the value of the prediction when $x=0$
  • $w$ is the slope: how much the prediction changes when $x$ increases

Examples:

  • If $w=0$ and $b=1.5$: the model predicts a constant $1.5$ (horizontal line)
  • If $w=0.5$ and $b=0$: the line passes through the origin and has slope $0.5$
  • If $w=0.5$ and $b=1$: same slope, shifted up by 1

Predictions on training examples #

A training example is written as: $(x^{(i)},y^{(i)})$

For input $x^{(i)}$, the model predicts:

\[ \hat{y}^{(i)} = f_{w,b}\!\left(x^{(i)}\right)=wx^{(i)}+b \] $f_{w,b}\!\left(x^{(i)}\right)$ : our prediction for example $i$ using parameters $w,b$

Goal: choose $w$ and $b$ so that $\hat{y}^{(i)}$ is close to $y^{(i)}$ for many (ideally all) training examples.


Computing Cost #

Equation for cost with one variable is:

\[ J(w,b)=\frac{1}{2m}\sum_{i=0}^{m-1}\left(f_{w,b}\!\left(x^{(i)}\right)-y^{(i)}\right)^2 \] $\left(f_{w,b}\!\left(x^{(i)}\right)-y^{(i)}\right)^2$ : the squared difference between the target value and the prediction

The squared differences are summed over all the $m$ examples and divided by $2m$ to produce the cost $J(w,b)$

Cost Fn


Intuition: error and squared error #

For one example $i$, the error is:

\[ \hat{y}^{(i)}-y^{(i)} \]

Squared error for example $i$:

\[ \left(\hat{y}^{(i)}-y^{(i)}\right)^2 \]

The cost function sums squared errors over the dataset and averages them (via $2m$).


Squared Error & Mean Squared Error (MSE) #

FeatureSquared ErrorMean Squared Error (MSE)
ScopeIndividual data point (residual)Entire dataset
Formula\( (y_i - \hat{y}_i)^2 \)\( \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \)
OutputOne value per observationOne single value for the model
PurposeMeasures individual errorMeasures overall model performance

Cost surface visualisation #

Simplified visualisation: set b=0 #

To build intuition, sometimes we simplify the model by setting $b=0$:

\[ f_w(x)=wx \]

Now the cost depends on one parameter:

\[ J(w)=\frac{1}{2m}\sum_{i=0}^{m-1}\left(wx^{(i)}-y^{(i)}\right)^2 \]

Plotting $J(w)$ versus $w$ gives a U-shaped curve (“bowl”).

Full visualisation: $J(w,b)$ as a surface #

With both parameters $w$ and $b$:

  • $J(w,b)$ becomes a 3D surface (bowl / hammock shape)
  • each point $(w,b)$ corresponds to a single value of $J$

Contour plot view #

A contour plot is a 2D way to visualise the same 3D surface:

  • x-axis: $w$
  • y-axis: $b$
  • each contour (oval) shows points with the same cost $J$

The centre of the smallest oval: is the minimum cost point.


Types #

flowchart TD
T["Cost function<br/>types"] --> REG["Regression"]
T --> CLS["Classification"]
T --> PROB["Probabilistic<br/>models"]
T --> REGZ["Regularisation"]

REG --> MSE["MSE"]
REG --> MAE["MAE"]
REG --> HUB["Huber"]

CLS --> CE["Cross-entropy<br/>(log loss)"]
CLS --> HNG["Hinge"]

PROB --> NLL["Negative<br/>log-likelihood"]

REGZ --> L2["L2 (Ridge)"]
REGZ --> L1["L1 (Lasso)"]
REGZ --> EN["Elastic<br/>Net"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style REG fill:#C8E6C9,stroke:#2E7D32,color:#000
style CLS fill:#C8E6C9,stroke:#2E7D32,color:#000
style PROB fill:#C8E6C9,stroke:#2E7D32,color:#000
style REGZ fill:#C8E6C9,stroke:#2E7D32,color:#000

style MSE fill:#CE93D8,stroke:#8E24AA,color:#000
style MAE fill:#CE93D8,stroke:#8E24AA,color:#000
style HUB fill:#CE93D8,stroke:#8E24AA,color:#000
style CE fill:#CE93D8,stroke:#8E24AA,color:#000
style HNG fill:#CE93D8,stroke:#8E24AA,color:#000
style NLL fill:#CE93D8,stroke:#8E24AA,color:#000
style L2 fill:#CE93D8,stroke:#8E24AA,color:#000
style L1 fill:#CE93D8,stroke:#8E24AA,color:#000
style EN fill:#CE93D8,stroke:#8E24AA,color:#000

Cost Fn

MSE #

measures the average of squared residuals in the dataset

MAE #

measures the average absolute error in the dataset

RMSE #

measures the standard deviation of residuals

Cross-Entropy (Log Loss) #

Used mainly in classification, especially Logistic Regression.

It measures the difference between the predicted probability and the true class label.

Per-example cost:

\[ \mathrm{Cost}\!\left(h_\theta(x),y\right) = -y\log\!\left(h_\theta(x)\right) -(1-y)\log\!\left(1-h_\theta(x)\right) \]

Cost over the full training set:

\[ J(\theta) = -\frac{1}{m}\sum_{i=1}^{m} \left[ y^{(i)}\log\!\left(h_\theta\!\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\!\left(1-h_\theta\!\left(x^{(i)}\right)\right) \right] \]

Why it is used:

  • works well for probabilistic classification
  • penalises confident wrong predictions heavily
  • gives a convex optimisation objective for Logistic Regression

Linear vs Logistic (5-Step Comparison) #

Model → Predict → Cost → Gradient → Update

StepLinear Regression (MSE / Squared Error)Logistic Regression (Log Loss / Cross-Entropy)
1) Model\( \hat{y}=wx+b \)\( z=w\cdot x+b,\quad p=\sigma(z)=\frac{1}{1+e^{-z}} \)
2) Predict\( \hat{y}^{(i)}=wx^{(i)}+b \)\( z^{(i)}=w\cdot x^{(i)}+b,\quad p^{(i)}=\sigma(z^{(i)}) \)
3) Cost\( J(w,b)=\frac{1}{2m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2 \)\( J(w,b)=\frac{1}{m}\sum_{i=1}^{m}\left[-y^{(i)}\log p^{(i)}-(1-y^{(i)})\log(1-p^{(i)})\right] \)
4) Gradients\( \frac{\partial J}{\partial w}=\frac{1}{m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)x^{(i)},\quad \frac{\partial J}{\partial b}=\frac{1}{m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right) \)\( \frac{\partial J}{\partial w}=\frac{1}{m}\sum_{i=1}^{m}\left(p^{(i)}-y^{(i)}\right)x^{(i)},\quad \frac{\partial J}{\partial b}=\frac{1}{m}\sum_{i=1}^{m}\left(p^{(i)}-y^{(i)}\right) \)
5) Update\( w:=w-\alpha\frac{\partial J}{\partial w},\quad b:=b-\alpha\frac{\partial J}{\partial b} \)\( w:=w-\alpha\frac{\partial J}{\partial w},\quad b:=b-\alpha\frac{\partial J}{\partial b} \)

Loss Function vs Cost Function #

Loss function:

  • defined on a single training example
  • measures how well the model performs on one example

Cost function:

  • aggregates loss over the whole training set
  • measures how well the model performs across the dataset

Training objective #

Training means: choose parameters that minimise the cost on the training data.

For regression → cost is often squared error. For classification → a common cost is cross-entropy (log loss).

Role of Gradient Descent in Updating the Weights #

Gradient Descent is an optimisation algorithm used to minimise the cost function and find the best-fit line for the model.

  • Iteratively adjust the weights of the model to reduce the error.
  • each iteration updates the weights in the direction that minimises the cost function leading to the optimal set of parameters.

References #


Home | Machine Learning