Cost Function

Cost Function #

  • also known as an objective function

  • measure of the difference between predicted values and actual values

  • quantifies the error between a model’s predicted values and actual values

  • measures the model’s error on a group of datapoints

  • method used to predict values by drawing the best-fit line through the data

  • used to evaluate the accuracy of a model’s predictions

  • guides the process of adjusting the model’s parameters in order to minimise the difference between predicted and actual values

flowchart TD
CF["Cost<br/>function"] -->|measures| ERR["Prediction<br/>error"]
CF -->|guides| OPT["Optimisation<br/>(training)"]

CF -->|includes| DATA["Data<br/>fit"]
CF -->|may include| PEN["Penalty<br/>(regularisation)"]

style CF fill:#90CAF9,stroke:#1E88E5,color:#000

style ERR fill:#CE93D8,stroke:#8E24AA,color:#000
style OPT fill:#CE93D8,stroke:#8E24AA,color:#000

style DATA fill:#C8E6C9,stroke:#2E7D32,color:#000
style PEN fill:#C8E6C9,stroke:#2E7D32,color:#000

Key takeaway: In ML, we choose model parameters to minimise a cost $J$. For linear regression, the most common choice is the squared error cost.


Training set and model #

You have a training set with:

  • input features $x$
  • output targets $y$

The linear regression model is:

\[ f_{w,b}(x)=wx+b \]

The values $w$ and $b$ are the parameters of the model. You adjust them during training to improve the model.

You may also hear:

  • $w,b$ called coefficients
  • $w,b$ called weights

What w and b do #

Different values of $w$ and $b$ give different straight lines.

  • $b$ is the y-intercept: the value of the prediction when $x=0$
  • $w$ is the slope: how much the prediction changes when $x$ increases

Examples:

  • If $w=0$ and $b=1.5$: the model predicts a constant $1.5$ (horizontal line)
  • If $w=0.5$ and $b=0$: the line passes through the origin and has slope $0.5$
  • If $w=0.5$ and $b=1$: same slope, shifted up by 1

Predictions on training examples #

A training example is written as: $(x^{(i)},y^{(i)})$

For input $x^{(i)}$, the model predicts:

\[ \hat{y}^{(i)} = f_{w,b}\!\left(x^{(i)}\right)=wx^{(i)}+b \] $f_{w,b}\!\left(x^{(i)}\right)$ : our prediction for example $i$ using parameters $w,b$

Goal: choose $w$ and $b$ so that $\hat{y}^{(i)}$ is close to $y^{(i)}$ for many (ideally all) training examples.


Computing Cost #

Equation for cost with one variable is:

\[ J(w,b)=\frac{1}{2m}\sum_{i=0}^{m-1}\left(f_{w,b}\!\left(x^{(i)}\right)-y^{(i)}\right)^2 \] $\left(f_{w,b}\!\left(x^{(i)}\right)-y^{(i)}\right)^2$ : the squared difference between the target value and the prediction

The squared differences are summed over all the $m$ examples and divided by $2m$ to produce the cost $J(w,b)$

Cost Fn


Intuition: error and squared error #

For one example $i$, the error is:

\[ \hat{y}^{(i)}-y^{(i)} \]

Squared error for example $i$:

\[ \left(\hat{y}^{(i)}-y^{(i)}\right)^2 \]

The cost function sums squared errors over the dataset and averages them (via $2m$).


Squared Error & Mean Squared Error (MSE) #

FeatureSquared ErrorMean Squared Error (MSE)
ScopeIndividual data point (residual)Entire dataset
Formula\( (y_i - \hat{y}_i)^2 \)\( \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \)
OutputOne value per observationOne single value for the model
PurposeMeasures individual errorMeasures overall model performance

Cost surface visualisation #

Simplified visualisation: set b=0 #

To build intuition, sometimes we simplify the model by setting $b=0$:

\[ f_w(x)=wx \]

Now the cost depends on one parameter:

\[ J(w)=\frac{1}{2m}\sum_{i=0}^{m-1}\left(wx^{(i)}-y^{(i)}\right)^2 \]

Plotting $J(w)$ versus $w$ gives a U-shaped curve (“bowl”).

Full visualisation: $J(w,b)$ as a surface #

With both parameters $w$ and $b$:

  • $J(w,b)$ becomes a 3D surface (bowl / hammock shape)
  • each point $(w,b)$ corresponds to a single value of $J$

Contour plot view #

A contour plot is a 2D way to visualise the same 3D surface:

  • x-axis: $w$
  • y-axis: $b$
  • each contour (oval) shows points with the same cost $J$

The centre of the smallest oval: is the minimum cost point.


Types #

flowchart TD
T["Cost function<br/>types"] --> REG["Regression"]
T --> CLS["Classification"]
T --> PROB["Probabilistic<br/>models"]
T --> REGZ["Regularisation"]

REG --> MSE["MSE"]
REG --> MAE["MAE"]
REG --> HUB["Huber"]

CLS --> CE["Cross-entropy<br/>(log loss)"]
CLS --> HNG["Hinge"]

PROB --> NLL["Negative<br/>log-likelihood"]

REGZ --> L2["L2 (Ridge)"]
REGZ --> L1["L1 (Lasso)"]
REGZ --> EN["Elastic<br/>Net"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style REG fill:#C8E6C9,stroke:#2E7D32,color:#000
style CLS fill:#C8E6C9,stroke:#2E7D32,color:#000
style PROB fill:#C8E6C9,stroke:#2E7D32,color:#000
style REGZ fill:#C8E6C9,stroke:#2E7D32,color:#000

style MSE fill:#CE93D8,stroke:#8E24AA,color:#000
style MAE fill:#CE93D8,stroke:#8E24AA,color:#000
style HUB fill:#CE93D8,stroke:#8E24AA,color:#000
style CE fill:#CE93D8,stroke:#8E24AA,color:#000
style HNG fill:#CE93D8,stroke:#8E24AA,color:#000
style NLL fill:#CE93D8,stroke:#8E24AA,color:#000
style L2 fill:#CE93D8,stroke:#8E24AA,color:#000
style L1 fill:#CE93D8,stroke:#8E24AA,color:#000
style EN fill:#CE93D8,stroke:#8E24AA,color:#000

Cost Fn

MSE #

measures the average of squared residuals in the dataset

MAE #

measures the average absolute error in the dataset

RMSE #

measures the standard deviation of residuals


Loss Function vs Cost Function #

Loss function:

  • defined on a single training example
  • measures how well the model performs on one example

Cost function:

  • aggregates loss over the whole training set
  • measures how well the model performs across the dataset

Role of Gradient Descent in Updating the Weights #

Gradient Descent is an optimisation algorithm used to minimise the cost function and find the best-fit line for the model.

  • Iteratively adjust the weights of the model to reduce the error.
  • each iteration updates the weights in the direction that minimises the cost function leading to the optimal set of parameters.

References #

  • Cost Function
  • /docs/ai/machine-learning/03-gradient-descent-linear-regression/
  • /docs/ai/machine-learning/03-linear-models-regression/

Home | Machine Learning