Decision Tree #
A decision tree classifies an example by asking a sequence of questions about its attributes until it reaches a leaf (final decision).
Key takeaway: A decision tree grows by repeatedly splitting the training data into purer subsets using an impurity measure (Entropy / Gini / Classification Error).
Information Theory #
Decision trees need a way to measure: “How mixed are the class labels at a node?”
Impurity (how mixed are the labels?) #
A node is:
- pure if all records belong to the same class → impurity = 0
- impure if records are mixed across classes
Entropy (base-2) #
Entropy measures impurity/uncertainty in the class distribution at a node.
\[ H(S)=-\sum_{c=1}^{C} p(c)\log_2 p(c) \]Binary case ($p_+$ positives, $p_-$ negatives):
\[ H(S)= -p_{+}\log_2(p_{+}) - p_{-}\log_2(p_{-}) \]Interpretation:
- $H(S)=0$ → pure node
- $H(S)$ is maximum when classes are evenly distributed
Information Gain (entropy reduction) #
Information gain is the expected reduction in entropy after splitting on attribute $A$.
\[ \mathrm{Gain}(S,A)=H(S)-\sum_{v\in \mathrm{Values}(A)}\frac{|S_v|}{|S|}H(S_v) \]Where:
- $S$ is the set of examples at the node
- $S_v$ is the subset where attribute $A$ takes value $v$
Decision rule:
- choose the attribute with maximum information gain
- equivalently: pick the attribute with minimum weighted child entropy
Numerical template (entropy + information gain) #
When you see question like: “Compute entropy / compute IG / choose the root node”:
- Count class distribution at the parent node (e.g., Yes/No).
- Compute parent entropy $H(S)$ using log base 2.
- For each candidate attribute $A$:
- split the dataset into subsets $S_v$
- compute entropy of each subset $H(S_v)$
- compute weighted child entropy: $\sum \frac{|S_v|}{|S|}H(S_v)$
- compute gain: $H(S)$ minus weighted child entropy
- Pick the attribute with highest gain (unless the question asks otherwise).
Exam note (important wording):
- If the question says “compute entropy only”, pick the split with minimum entropy.
- If the question says “information gain”, pick the split with maximum gain.
Alternative impurity measures #
Gini Index (used in CART) #
\[ \mathrm{Gini}(S)=1-\sum_{c=1}^{C}p(c)^2 \]Lower Gini → purer node.
Classification Error #
\[ \mathrm{Err}(S)=1-\max_c p(c) \]Note: classification error is less sensitive than entropy/Gini, so it is often not used for growing the tree (more for pruning/evaluation).
Gain Ratio (fixes information gain bias) #
Information gain can unfairly prefer attributes with many distinct values (e.g., CustomerID). Gain ratio penalises such splits.
Split information:
\[ \mathrm{SplitInfo}(S,A)= -\sum_{v\in \mathrm{Values}(A)}\frac{|S_v|}{|S|}\log_2\left(\frac{|S_v|}{|S|}\right) \]Gain ratio:
\[ \mathrm{GainRatio}(S,A)=\frac{\mathrm{Gain}(S,A)}{\mathrm{SplitInfo}(S,A)} \]Entropy Based Decision Tree Construction #
A decision tree is grown recursively by splitting the data into purer subsets.
Tree structure #
- Root node: no incoming edges
- Internal node: test condition on one attribute (two or more outgoing edges)
- Leaf node: class label (no outgoing edges)
Hunt’s Algorithm (high-level) #
At each node:
- If all records belong to the same class → make it a leaf.
- Otherwise:
- choose the best attribute test condition (Entropy/IG or Gini)
- split records into children
- repeat recursively for each child
- Stop when:
- all records are learned, or
- no more useful splits exist, or
- constraints (hyperparameters) say “stop”.
ID3 summary (entropy + information gain) #
ID3 builds the tree using information gain.
Stopping / base cases:
- if all examples at the node are the same class → leaf
- if no attributes remain → leaf with most common class (majority vote)
- if a split produces an empty subset → leaf with most common class of the parent node
Exam note: “min entropy” vs “max gain” #
Sometimes a question may explicitly ask you to:
- compute entropy only and pick the split with minimum entropy, or
- compute information gain and pick the split with maximum gain
Always follow the wording of the question.
Avoiding Overfitting #
Underfitting vs overfitting #
- Underfitting: model too simple → high training error and high test error
- Overfitting: model too complex → low training error but high test error
In decision trees:
- very deep trees can memorise noise/outliers
- the tree may look “perfect” on training data but perform poorly on unseen data
How to reduce overfitting #
Common approaches:
- pruning
- constrain the depth of the tree
- constrain the minimum samples allowed in a leaf (minimum records per node)
Pre-pruning (early stopping) #
Stop tree growth early using stopping conditions, such as:
- stop if Gain (or Gini gain) is below a threshold
- stop if node has fewer than a minimum number of samples
- stop if depth exceeds a maximum level
- stop if improvement in generalisation is not significant
Trade-off:
- too strict → shallow tree → underfitting
- too loose → deep tree → overfitting
Post-pruning (grow then prune) #
Idea:
- grow the tree fully (or until minimal leaf size)
- scan the tree bottom-up
- at each decision node:
- evaluate the subtree on a prune/validation set
- remove the node (replace by a leaf using majority vote) and re-evaluate
- if error reduces, prune; otherwise keep
- repeat on other branches
Error rate = % misclassified tuples on the prune/validation set.
Minimum Description Length #
MDL is an information-theoretic way to trade off:
- tree complexity
- training errors
Interesting fact: the optimal (shortest expected) coding length for an event with probability $p$ is $-\log_2(p)$ bits.
MDL prefers the hypothesis $h$ that minimises:
\[ LC(h)+LC(D\mid h) \]Where:
- $LC(h)$ = number of bits to describe the tree (structure, attributes, thresholds)
- $LC(D\mid h)$ = number of bits to describe the mistakes/exceptions made by the tree
Intuition (Occam’s razor): prefer shorter trees, as long as they still classify reasonably well.
Handling Continuous valued attributes, missing attributes #
Attribute test conditions (categorical vs continuous) #
- Binary attribute: two-way split
- Nominal attribute: multi-way split or binary split by grouping categories
- Continuous attribute: choose a threshold and split
Continuous-valued attributes (choose threshold) #
Given a continuous attribute $A$, create a binary attribute $A_c$:
\[ A_c = \begin{cases} \text{True} & A < c \\\\ \text{False} & A \ge c \end{cases} \]How to find $c$:
Way 1: Multi-way split (PlayTennis-style) #
- sort examples by the attribute value
- identify candidate thresholds at points where the class label changes
- candidate threshold = average of the two consecutive values
Way 2: Binary splits (scan all candidates) #
- sort values
- take midpoints between adjacent sorted values as candidates
- evaluate impurity for each candidate split
- pick the threshold that gives the best impurity reduction
Missing attribute values (practical handling) #
Situations you’ll see in problems:
- After splitting on an attribute, a branch may receive no samples.
- Make that branch a leaf with the majority class from the parent node.
- If no attributes remain for further splitting:
- stop and apply majority voting to label the leaf.
In prediction:
- if a required attribute value is missing for a test record, common strategies include using the majority class at that node (or following the most frequent branch).