Instance-based Learning #

Instance-based learning is a family of methods that do not build one explicit global model during training. Instead, they store training examples and delay most of the work until a new query arrives.

When a new point must be classified or predicted, the algorithm compares it with previously seen examples, finds the most relevant neighbours, and uses them to produce the answer.

Instance-based Learning covers three linked ideas:

$k$-Nearest Neighbour (KNN) learning
Locally Weighted Regression (LWR)
Radial Basis Functions (RBF)

Key takeaway: Instance-based learning is a lazy learning approach. Training is usually cheap because we mainly store the data. Prediction is more expensive because the algorithm must search for similar records at query time. The most important ideas for exams are: choosing a good distance measure, choosing $k$, understanding weighted neighbours, and knowing why nearby data should matter more than faraway data.

Model-Based vs Instance-Based #

Model-based methods try to learn a compact predictive model during training. Examples include Linear Regression, Logistic Regression, and Decision Trees. Instance-based learning keeps the training data instead, postponing most computation until prediction time.

Feature	Instance-Based Learning	Model-Based Learning
Approach	Memorises examples	Learns general patterns/rules
Training Time	Minimal (just stores data)	Can be long (building the model)
Prediction Time	Slow (searches all data)	Fast (uses a pre-built formula)
Storage	High (must keep all data)	Low (only stores the model)
Adaptability	Adapts instantly to new data	Requires retraining to update

graph LR
    IBL[Instance-Based Learning]
    IBL --> Numeric[Numerical / Distance-Based]
    IBL --> Symbolic[Symbolic / Reasoning-Based]
    IBL --> Density[Density / Distribution-Based]
    Numeric --> KNN[k-Nearest Neighbours]
    Numeric --> LWR[Locally Weighted Regression]
    Numeric --> RBF[Radial Basis Functions]
    Numeric --> LVQ[Learning Vector Quantization]
    Symbolic --> CBR[Case-Based Reasoning]
    Density --> KDE[Kernel Density Estimation]
    Density --> PW[Parzen Windows]
    style IBL fill:#F8D7E8,stroke:#7A6170,stroke-width:3px,color:#2F2F2F
    style Numeric fill:#D8EAFE,stroke:#5E7084,color:#2F2F2F
    style Symbolic fill:#DDF5E4,stroke:#637A68,color:#2F2F2F
    style Density fill:#FFF1C7,stroke:#8A7A54,color:#2F2F2F
    style KNN fill:#E6E0FF,stroke:#6F6690,color:#2F2F2F
    style LWR fill:#D7F3F0,stroke:#5B7D7A,color:#2F2F2F
    style RBF fill:#FFE0CC,stroke:#8B6A58,color:#2F2F2F
    style LVQ fill:#EDE7D9,stroke:#7B7062,color:#2F2F2F
    style CBR fill:#E2F0CB,stroke:#687A55,color:#2F2F2F
    style KDE fill:#FFE5EC,stroke:#8A6570,color:#2F2F2F
    style PW fill:#FDE2E4,stroke:#8A6570,color:#2F2F2F

flowchart LR
  A[Training data] --> B{Learning style}
  B --> C[Model-based learning]
  B --> D[Instance-based learning]
  C --> E[Learn parameters or rules during training]
  C --> F[More training time]
  C --> G[Less prediction time]
  D --> H[Store training instances]
  D --> I[Less training time]
  D --> J[More prediction time]
  style A fill:#FFF1C7,stroke:#8A7A54,color:#2F2F2F
  style B fill:#F8D7E8,stroke:#7A6170,color:#2F2F2F
  style C fill:#D8EAFE,stroke:#5E7084,color:#2F2F2F
  style D fill:#DDF5E4,stroke:#637A68,color:#2F2F2F
  style E fill:#EAF4FF,stroke:#6B7D90,color:#2F2F2F
  style F fill:#EAF4FF,stroke:#6B7D90,color:#2F2F2F
  style G fill:#EAF4FF,stroke:#6B7D90,color:#2F2F2F
  style H fill:#EAF8EE,stroke:#637A68,color:#2F2F2F
  style I fill:#EAF8EE,stroke:#637A68,color:#2F2F2F
  style J fill:#EAF8EE,stroke:#637A68,color:#2F2F2F

Eager Learners vs Lazy Learners #

Model-based methods are often called eager learners — they spend more effort during training and less during prediction.

Instance-based methods are often called lazy learners — they spend little effort during training and more during prediction.

This distinction is one of the most important conceptual points in this module.

KNN #

“If it walks like a duck and quacks like a duck, it is probably a duck.”

The machine learning version: if a new point looks similar to known points of a certain class, it is likely to belong to that class. The whole method depends on two questions:

How do we define similarity or distance?
How many neighbours should influence the final answer?

General Workflow #

flowchart TD
  A[Store training examples] --> B[Receive query point x_q]
  B --> C[Compute distance or similarity]
  C --> D[Select relevant neighbours]
  D --> E{Task type}
  E --> F[Classification: vote]
  E --> G[Regression: average or weighted average]
  F --> H[Predicted class]
  G --> I[Predicted value]
  style A fill:#DDF5E4,stroke:#637A68,color:#2F2F2F
  style B fill:#FFF1C7,stroke:#8A7A54,color:#2F2F2F
  style C fill:#D8EAFE,stroke:#5E7084,color:#2F2F2F
  style D fill:#E6E0FF,stroke:#6F6690,color:#2F2F2F
  style E fill:#F8D7E8,stroke:#7A6170,color:#2F2F2F
  style F fill:#FFE0CC,stroke:#8B6A58,color:#2F2F2F
  style G fill:#D7F3F0,stroke:#5B7D7A,color:#2F2F2F
  style H fill:#EAF8EE,stroke:#637A68,color:#2F2F2F
  style I fill:#EAF8EE,stroke:#637A68,color:#2F2F2F

Store the training set
Receive a query point $x_q$
Compute the distance between $x_q$ and training points
Select one or more nearest neighbours
Use those neighbours to infer the output

Classification and Regression #

KNN can be used for both classification and regression. Given training examples $\langle x_i, f(x_i) \rangle$, for a new query point $x_q$, find the nearest examples and use their outputs to estimate $f^*(x_q)$.

For 1-NN:

\[ f^*(x_q)=f(x_n) \]

where $x_n$ is the nearest training example to $x_q$.

For classification — use majority vote among the $k$ nearest neighbours:

\[ \hat{f}(x_q)=\arg\max_{v \in V}\sum_{i=1}^{k}\delta\!\big(v,f(x_i)\big) \]

where $V$ is the set of possible class labels and $\delta(a,b)=1$ if $a=b$, otherwise $0$.

For regression — use the average (or median) of the $k$ nearest neighbours:

\[ \hat{f}(x_q)=\frac{1}{k}\sum_{i=1}^{k} f(x_i) \]

Median can be more robust when outliers are present.

Training and Prediction #

Training phase: KNN does almost no parameter learning. It stores input features and corresponding outputs, so training is simple and fast.

Prediction phase: For each query point the algorithm must compare against many or all training examples, compute distances, sort to identify nearest points, and then vote or average. This is why KNN is called a lazy learner.

Step-by-Step Procedure #

For classification:

Take the query point $x_q$
Compute distance from $x_q$ to every training point
Sort distances from smallest to largest
Select the first $k$ neighbours
Count the class labels among them
Return the class with the highest count

For regression:

Take the query point $x_q$
Compute distance to all training points
Select the first $k$ neighbours
Average their target values
Return the average as the prediction

Choosing $k$ #

The value of $k$ is the key hyperparameter in KNN.

If $k$ is too small, the model becomes too sensitive to noise
If $k$ is too large, the model becomes too smooth and may ignore local structure

In binary classification, using an odd value of $k$ is recommended to avoid ties in voting.

Rule of thumb:

\[ k=\sqrt{N} \]

where $N$ is the number of training points. This is a starting point only, not a guaranteed best choice.

Elbow method: Try several values of $k$, compute an error function (e.g., SSE), plot error against $k$, and choose the point after which improvement becomes much smaller. A common example identifies $k=3$ as the elbow point.

$k$ size	Behaviour
Small	High variance, can overfit, highly local
Large	Higher bias, smoother boundary, less sensitive to noise

Distance and Similarity #

KNN does not learn a separating equation in advance, so prediction quality depends heavily on the distance or similarity function chosen.

A data matrix has $n$ data points each with $p$ attributes. The corresponding distance matrix stores pairwise distances between points — it is symmetric and often shown in triangular form.

Properties of a valid metric for $d(i,j)$:

$d(i,j) > 0$ if $i \neq j$ (positive definiteness)
$d(i,i)=0$ (identity)
$d(i,j)=d(j,i)$ (symmetry)
$d(i,j) \le d(i,k)+d(k,j)$ (triangle inequality)

Minkowski distance (general family):

\[ d(i,j)=\left(\sum_{f=1}^{p}|x_{if}-x_{jf}|^h\right)^{1/h} \]

where $h$ is the order of the norm.

Manhattan ($L_1$) — useful when absolute coordinate-wise difference matters:

\[ d(i,j)=\sum_{f=1}^{p}|x_{if}-x_{jf}| \]

Euclidean ($L_2$) — the most common geometric distance:

\[ d(i,j)=\sqrt{\sum_{f=1}^{p}(x_{if}-x_{jf})^2} \]

Supremum ($L_\infty$) — controlled by the largest coordinate difference:

\[ d(i,j)=\max_f |x_{if}-x_{jf}| \]

Similarity for Different Attribute Types #

Nominal attributes (no natural order, e.g., profession, colour, mother tongue):

Simple matching: matching values contribute similarity; differing values contribute dissimilarity
Binary indicator encoding: one binary feature per possible nominal state (similar to one-hot encoding)

Binary attributes:

Symmetric binary: both states are equally important (e.g., gender)
Asymmetric binary: one state is more important, common in presence/absence situations (e.g., symptoms, test results)

Jaccard similarity for asymmetric binary variables:

\[ \mathrm{sim}_{\mathrm{Jaccard}}(i,j)=\frac{q}{q+r+s} \]

where $q$ counts shared positives. The related dissimilarity is:

\[ d(i,j)=\frac{r+s}{q+r+s} \]

Ordinal attributes (ordered values, e.g., Village → Small Town → Suburban → Metropolitan):

Replace values by ranks
Map ranks to $[0,1]$
Treat them like interval-scaled variables

Mixed data types — Gower distance: Handles a mix of nominal, binary, numeric, and ordinal attributes by computing component-wise dissimilarity per attribute type, normalising appropriately, and combining with equal or chosen weights.

Typical workflow for mixed data:

Normalise numeric attributes
Rank and scale ordinal attributes
Compare nominal attributes by match/mismatch
Combine the partial distances

Feature Scaling #

KNN is sensitive to scale. If one numeric feature has a much larger range than another, it can dominate the distance computation. Without scaling, high-range features dominate, the neighbourhood becomes misleading, and prediction quality drops.

Z-score standardisation:

\[ z=\frac{x-\mu}{\sigma} \]

where $x$ is the raw value, $\mu$ is the mean, and $\sigma$ is the standard deviation. Min-max normalisation is also used in some exam questions.

Curse of Dimensionality #

If many features are used but only a few are truly relevant, two genuinely similar records may appear far apart because distance is computed across many irrelevant dimensions. Traditional KNN uses all attributes unless you intervene, so irrelevant attributes distort the geometry of the space.

Remedies:

Remove less relevant attributes — validate performance after removal
Weight attributes differently — give lower weight to less relevant features; conceptually this shortens less relevant axes and lengthens more relevant ones

Challenges of KNN #

Challenge	Description
Choosing $k$	A hyperparameter; bad choice causes overfitting or over-smoothing
Choosing the distance metric	Results differ across Euclidean, Manhattan, mixed-type, and weighted distance
Query-time cost	Each prediction may require comparison against all training points
Noise and irrelevant features	Can mislead the nearest-neighbour search

Distance-Weighted KNN #

Plain KNN may fail when a query point lies near a few very close examples of one class, but the majority of slightly farther neighbours belong to another class. A plain majority vote may predict incorrectly.

Distance-weighted KNN fixes this by assigning larger influence to closer points using a kernel weight function:

\[ w_i = K\big(d(x_q,x_i)\big) \]

Example kernels:

\[ K\big(d(x_q,x_i)\big)=\frac{1}{d(x_q,x_i)^2} \] \[ K\big(d(x_q,x_i)\big)=\frac{1}{(d_0+d(x_q,x_i))^2} \] \[ K\big(d(x_q,x_i)\big)=\exp\left(-\left(\frac{d(x_q,x_i)}{\sigma_0}\right)^2\right) \]

Weighted classification rule:

\[ \hat{f}(x_q)=\arg\max_{v \in V}\sum_{i=1}^{k} w_i\,\delta\!\big(v,f(x_i)\big) \]

Weighted regression rule:

\[ \hat{f}(x_q)=\frac{\sum_{i=1}^{k} w_i f(x_i)}{\sum_{i=1}^{k} w_i} \]

A larger distance means smaller weight; a smaller distance means larger weight. Similarity increases the influence of that training instance on the final prediction.

Locally Weighted Regression #

LWR is the regression counterpart of the “nearby points should matter more” idea, and is one of the key syllabus items in this module.

Intuition #

Global linear regression fits one line for all data points
Locally weighted regression fits a small weighted local model around the query point

If the query changes, its local fitted line may also change.

Ordinary Linear Regression	Locally Weighted Regression
One global hypothesis	Local model per query point
Shared slope and intercept	Coefficients depend on the query
Same line used for all predictions	Different local line for different queries

Workflow #

Take query point $x_q$
Compute distance from every training point to $x_q$
Convert distances to weights using a kernel
Fit a weighted local regression model
Evaluate the fitted model at $x_q$
Return the local prediction

LWR as Weighted Least Squares #

\[ J_q(w)=\frac{1}{2}\sum_{i=1}^{n} w_i(x_q)\big(h_w(x_i)-y_i\big)^2 \]

where $w_i(x_q)$ is the weight of training point $x_i$ relative to $x_q$, and $h_w(x_i)$ is the local regression prediction. The final query prediction is:

\[ \hat{y}_q = h_{w^{(q)}}(x_q) \]

The effective parameters $w^{(q)}$ depend on the query point, so the fitted local model changes from query to query.

Two Design Viewpoints #

View 1 — Local approximation: Use only nearby neighbours, then fit a local regression on that neighbourhood.

View 2 — Weighted global approximation: Use all data, but multiply each error term by a weight that decreases with distance from the query. Far points contribute very little, giving the benefit of the full dataset while still focusing on the neighbourhood.

Kernel Choice and Bandwidth #

The kernel is not one formula for all datasets — experimentation is needed. Possible choices include inverse distance, inverse squared distance, exponential/Gaussian weighting, and quadratic kernel variants.

Bandwidth guides locality:

Small bandwidth — very local behaviour, strong emphasis on nearby points
Large bandwidth — smoother behaviour, broader neighbourhood influence

When LWR Is Useful #

One single global linear trend is too crude
The relationship varies across the input space
Nearby points are more informative than distant ones

Radial Basis Functions (RBF) #

An RBF is a function whose value depends mainly on the distance from a centre. A common Gaussian RBF:

\[ \phi_j(x)=\exp\left(-\frac{\|x-c_j\|^2}{2\sigma^2}\right) \]

where $c_j$ is the centre and $\sigma$ controls the spread.

If $x$ is close to $c_j$, then $\phi_j(x)$ is large
If $x$ is far from $c_j$, then $\phi_j(x)$ becomes small

Model Form #

\[ \hat{y}(x)=w_0+\sum_{j=1}^{M} w_j\,\phi_j(x) \]

This is linear in the weights $w_j$ but non-linear in the original input $x$.

Connection to the Module Theme #

RBF shares the same local-neighbourhood philosophy as KNN and LWR:

KNN — use nearby examples more
LWR — fit a local weighted model around the query
RBF — build features that are naturally strongest near selected centres and weaker farther away

Comparison of KNN, LWR, and RBF #

flowchart TD
  A[One global regression line] --> B[Ordinary linear regression]
  C[Neighbour vote or average] --> D[KNN]
  E[Weighted neighbours] --> F[Distance-weighted KNN]
  G[Local fitted regression line] --> H[Locally weighted regression]
  I[Local basis centred at prototypes] --> J[Radial Basis Functions]
  style A fill:#D8EAFE,stroke:#5E7084,color:#2F2F2F
  style B fill:#EAF4FF,stroke:#6B7D90,color:#2F2F2F
  style C fill:#DDF5E4,stroke:#637A68,color:#2F2F2F
  style D fill:#EAF8EE,stroke:#637A68,color:#2F2F2F
  style E fill:#FFF1C7,stroke:#8A7A54,color:#2F2F2F
  style F fill:#FFF7DC,stroke:#8A7A54,color:#2F2F2F
  style G fill:#F8D7E8,stroke:#7A6170,color:#2F2F2F
  style H fill:#FDEAF3,stroke:#7A6170,color:#2F2F2F
  style I fill:#FFE0CC,stroke:#8B6A58,color:#2F2F2F
  style J fill:#FFF0E6,stroke:#8B6A58,color:#2F2F2F

Methods at a Glance #

Method	Main idea	Output style	What changes with query?
KNN	Use nearest stored neighbours	Vote / average	Chosen neighbours
Weighted KNN	Use neighbours, weight close ones more	Weighted vote / weighted average	Weights and influence
LWR	Fit a local weighted regression model around query	Local regression prediction	Local fitted coefficients
RBF	Use local basis functions centred at prototypes	Weighted basis-function output	Basis activations

Weighted KNN vs LWR #

	Distance-weighted KNN	LWR
Local regression fit	No	Yes
Uses neighbour outputs directly	Yes	No
Decision rule	Weighted vote or weighted average	Value from local fitted model

LWR is a more structured local regression version of the “nearby points matter more” principle.

Advantages and Limitations #

Advantages:

Conceptually simple — similar inputs should give similar outputs
Flexible — no single global model shape is forced on the data
Works for both classification and regression
Naturally local — useful when local neighbourhood structure is more meaningful than one global formula
Little or no training cost — training often means only storing the examples

Limitations:

Expensive prediction time — distance must be computed at query time
Sensitive to scaling — unscaled features can dominate distance
Sensitive to irrelevant features — leads to misleading neighbourhoods
Sensitive to choice of $k$ — poor hyperparameter choice hurts performance
Suffers in high dimensions — curse of dimensionality is a major issue
Memory-heavy — training examples must be stored

Study Notes #

Key Facts to Remember #

KNN is a lazy learner; model-based learning is an eager learner
KNN solves both classification (majority vote) and regression (mean or median)
$k$ is a hyperparameter; odd $k$ helps avoid ties in binary classification
Distance metric choice matters; feature scaling matters
High dimensionality can mislead nearest-neighbour methods
Weighted KNN gives more influence to closer neighbours
LWR builds a local model around the query
RBF captures local influence through distance from centres

Numerical Template: KNN #

Write the query point
Compute distance from the query to every training point
Sort distances in ascending order
Take the first $k$ rows
Apply the decision rule:
- Classification — count labels and apply majority vote
- Regression — average the target values of the selected neighbours
State the conclusion: “Hence the predicted class is …” or “Hence the predicted value is …”

Numerical Template: Weighted KNN #

Prepare the data — normalise numeric variables, rank ordinal variables, handle nominal attributes using match/mismatch distance
Compute the distance from query $x_q$ to each record
Apply the kernel function to each distance to compute the weight
Sum total weight class-wise
Choose the class with the largest total weight
Explain how plain KNN may prefer the majority class while weighted KNN may prefer the more similar class

Numerical Template: LWR #

Identify the query point $x_q$
Compute distance from each training point to $x_q$
Apply the chosen kernel to obtain weights
Form the weighted local loss
Fit the local coefficients or interpret the local fitted curve
Use the resulting local model to compute $\hat{y}_q$
Explain why the answer differs from a global regression line

Common Exam Question Patterns #

KNN Classification — compute Euclidean distance, sort neighbours, apply majority vote; compare answers for $k=3$, $k=5$, $k=7$
KNN Regression — compute nearest neighbours and average their target values
Best value of $k$ — use odd $k$ for binary classification; identify elbow point from error plot; explain effect of small vs large $k$
Mixed-data dissimilarity — normalise numeric features, rank ordinal features, compare nominal features, combine partial distances
Weighted KNN — compute distances, compute kernel weights, show why weighted result may differ from unweighted majority vote
LWR conceptual — explain why LWR produces different local lines for different query points; compare global regression with LWR
Curse of dimensionality — define it, explain why KNN is misled in high dimensions, suggest remedies

Summary #

Instance-based learning is built on one central belief: a useful prediction can often be made by looking at similar previously observed examples.

KNN is the simplest and most direct version of this idea. Distance-weighted KNN improves it by giving stronger influence to closer neighbours. Locally weighted regression extends the same idea to local regression fitting. Radial basis functions provide another local-distance-based way to construct non-linear predictors.

Although these methods look different, they are all connected by the same local learning principle:

Nearby points should matter more than faraway points.

References #

Home | Machine Learning