Entropy and Information Gain

Information theory is the mathematical study of the quantification, storage and communication of information. A key measure in information theory is entropy.

Entropy in information theory is a description of how unpredictable a probability distribution is. Alternatively, entropy is also defined as how much information each example* contains. A distribution has the highest possible entropy when all values of a random variable are equally likely.

(* Example is the values of one row of features and possibly a label.)

Introduction to Information Theory

A Mathematical Theory of Communication is a foundational paper written by Claude Shannon in 1948, which established the field of information theory. Here’s a brief summary of its key concepts:

1. Information and Entropy:

Shannon introduces the concept of information as a measurable quantity. He defines the entropy of a message source as a measure of the uncertainty or randomness in the information being transmitted.
Higher entropy means greater unpredictability, and lower entropy means more predictability. Entropy can be used to quantify the amount of information produced by a source.

2. Communication System:

The paper outlines a general communication system model with components: source, transmitter, channel, receiver, and destination.
The source produces messages, the transmitter encodes them, the channel is the medium through which messages travel (which can introduce noise or distortion), the receiver decodes the message, and the destination is the end-user.

3. Channel Capacity:

Shannon defines channel capacity as the maximum rate at which information can be transmitted over a communication channel without error. This is a key concept in determining the efficiency of a communication system.
He uses mathematical formulas to calculate the capacity of various types of channels, including noisy channels.

4. Noisy Channels and Error Correction:

Shannon discusses how noise affects communication, introducing the noisy-channel coding theorem, which states that reliable communication is possible over a noisy channel as long as the transmission rate does not exceed the channel’s capacity.
He suggests the use of error-correcting codes to compensate for noise and improve the reliability of communication.

5. Shannon’s Theorem:

One of the major results of the paper is Shannon’s theorem, which mathematically proves that there is a theoretical limit (channel capacity) for error-free communication. By using appropriate coding techniques, it is possible to approach this limit.

6. Redundancy and Coding:

Shannon shows that redundancy in communication (repeating parts of the message) can help counteract noise and ensure reliable transmission.
He explains how coding helps to encode messages in ways that reduce errors during transmission, ensuring that the original message can be accurately reconstructed.

Entropy in Data Science

In data science, particularly in decision tree algorithms (like ID3, C4.5, and CART), Shannon’s concept of entropy plays a crucial role in quantifying uncertainty or impurity in data. Here’s how entropy, as introduced by Claude Shannon in A Mathematical Theory of Communication, relates to decision trees in machine learning:

1. Shannon’s Entropy in Communication:

In Shannon’s information theory, entropy is a measure of the uncertainty or unpredictability of a source of information. If the source has multiple possible outcomes (for example, messages or symbols), the entropy quantifies how much uncertainty there is in predicting the next symbol. The formula for entropy is:

[math]H(X) = – \sum_{i=1}^{n} p(x_i) \log_2 p(x_i)[/math]

where:

p(x_i) is the probability of each possible outcome x_i,
The sum is taken over all possible outcomes.

Entropy reaches its maximum, that is 1, when all outcomes are equally likely, and it is 0 when there is no uncertainty (i.e., the outcome is always the same).

2. Entropy in Decision Trees:

In decision tree algorithms, entropy is used to measure the impurity or uncertainty of a dataset at each node of the tree. The goal of building a decision tree is to split the data at each node such that the resulting child nodes are as pure (certain) as possible, i.e., they contain examples of predominantly one class.

Here’s how entropy works in decision trees:

Before a split (in a parent node), if the data is perfectly mixed (i.e., there is a 50-50 split between classes), the entropy is high because there’s a lot of uncertainty in predicting the class of a new data point.
After a split (in child nodes), if the data in the child nodes is more homogenous (i.e., mostly one class), the entropy is low because there’s less uncertainty about the class of a new data point.

3. Information Gain:

Decision tree algorithms, such as ID3, use information gain, which is based on entropy, to decide the best feature to split the data on at each node. Information gain is a measure of how much uncertainty (entropy) is reduced after a split:

[math]\text{Information Gain} = H(\text{parent}) – \sum_{i} \frac{|D_i|}{|D|} H(D_i)[/math]

where:

H(parent) is the entropy of the parent node,
D is the dataset at the parent node,
D_i is the subset of the dataset resulting from a particular split,
H(D_i) is the entropy of that subset.

The split with the highest information gain is chosen, meaning the feature that best reduces uncertainty is selected for the decision node.

4. Relation to Decision Tree Growth:

The decision tree construction process continues by recursively selecting the feature that provides the greatest reduction in entropy (i.e., the highest information gain) until:
- The data in the node is perfectly pure (entropy = 0),
- A stopping criterion is met (like a maximum tree depth or minimum samples in a node).

This means that decision trees aim to create splits that result in child nodes with lower entropy, leading to more predictable outcomes (classes).

5. Interpretation:

High entropy in a decision tree node indicates that the data at that point is mixed or uncertain, requiring a further split to reduce the uncertainty.
Low entropy suggests that the data in the node is mostly of one class, meaning the tree can stop splitting or make a prediction for that node.

In summary, entropy helps the decision tree algorithm quantify uncertainty about the class labels in the data and guides the tree-building process to reduce this uncertainty at each step, ultimately leading to a decision tree that classifies data effectively.

Sample Calculations

Since we’ve seen how concepts from information theory have been adapted and applied in machine learning and data science, now it would be helpful to make some calculations.

Sample 1

Entropy

We can write the entropy formula in another way also:

The entropy of a set with two possible values “0” and “1” (for example, the labels in a binary classification problem) has the following formula:

H = -p log p – q log q = -p log p – (1-p) * log (1-p)

where:

H is the entropy.
p is the fraction of “1” examples.
q is the fraction of “0” examples. Note that q = (1 – p)
log is generally log₂. In this case, the entropy unit is a bit.

For example, suppose the following:

600 examples contain the value “1”
400 examples contain the value “0”

Therefore, the entropy value is:

p = 0.6
q = 0.4
H = (-0.6) · log₂(0.6) – (0.4) · log₂(0.4) ≈ 0.971 bits per example

A set that is perfectly balanced (for example, 200 “0”s and 200 “1”s) would have an entropy of 1.0 bit per example.

H = – [(0.5) · log₂(0.5) + (0.5) · log₂(0.5)] = -2 (0.5 · log₂(0.5)) = -2 (0.5 · -1) = 1

As a set becomes more imbalanced, its entropy moves towards 0.0.

H = – [(1) · log₂(1) + (0) · log₂(0)] = – (1 · log₂1) = 0

Information Gain

The Information Gain measures the expected reduction in entropy. Entropy measures impurity in the data and information gain measures reduction in impurity in the data. The feature which has minimum impurity will be considered as the root node.

Information gain of a parent node can be calculated as the entropy of the parent node subtracted entropy of the weighted average of the child node.

Let’s continue with the dataset above, where we had 600 observations belonging to the class YES, and 400 observations belonging to class NO.

Assume we have these outcomes coming from a variable called ‘color’ that has the following values:

Blue color has 300 Y and 300 N outcomes
Green color has 300 Y and 100 N outcomes

We’ve already calculated the entropy above.

H(Outcome) ≈ 0.971

Now, let’s calculate the entropy for Outcome for each value of Color and add them using a weighted average of the proportion of observations.

H(O_Blue) = – [(300/600) · log₂(300/600) + (300/600) · log₂(300/600)] = 1

H(O_Green) = – [(300/400) · log₂(300/400) + (100/400) · log₂(100/400)] ≈ 0.811

Weighted average = 600/1000 · H(O_Blue) + 400/1000 · H(O_Green)
= 6/10 · 1 + 4/10 · 0.811

= 0.924

Information Gain (O, Color) = H(O) – Weighted Average = 0.971 – 0.924 = 0.047

Sample 2

For a dataset having many features, the information gain of each feature is calculated. The feature with the maximum information gain will be the most important feature and will be the root node for the decision tree.

To demonstrate that, let’s look at a different sample.

Imagine we have a population of 300 instances. 160 of them own a cat, the other 140 don’t have a cat. We also have two features and their values are:

Bank Account: < 20k, ≥ 20k
Marital Status: Single, Married, Other

Let’s assume the features’ relationship with owning a cat are like the following.

	Have cat : )	No Cat : (
Account: < 20 k	120	10
Account: ≥ 20 k	40	130

	Have cat : )	No Cat : (
Status: Single	70	10
Status: Married	40	60
Status: Other	50	70

Let’s demonstrate how a decision tree algorithm would decide on what attribute to split first by checking what features provide more information (or reduce more uncertainty about the target variable).

Entropy

First we need to calculate the entropy for the parent node, that is about having a cat.

E(Cat) = – (160/300) · log₂(160/300) – (140/300) · log₂(140/300) ≈ 0.99

Let’s start with ‘Bank Account’ and see how much uncertainty the tree can reduce by splitting on ‘Bank Account’.

Information Gain: Feature 1 – Bank Account

E(Account <20k) = – (120/130) · log₂(120/130) – (10/130) · log₂(10/130) ≈ 0.39

E(Account ≥20k) = – (40/170) · log₂(40/170) – (130/170) · log₂(130/170) ≈ 0.79

Weighted Average of entropy for each node:

E(Account) = 130/300 · 0.39 + 170/300 · 0.79 = 0.62

Information Gain:

IG(Cat, Bank Account) = E(Cat) – E(Account) = 0.99 – 0.62 = 0.37

Splitting on feature ‘Account’ leads to an information gain of 0.37 on the target variable. Let’s repeat the process for the feature ‘Status’ and compare the results.

Information Gain: Feature 2 – Marital Status

E(Status Single) = – (70/80) · log₂(70/80) – (10/80) · log₂(10/80) ≈ 0.54

E(Status Married) = – (40/100) · log₂(40/100) – (60/100) · log₂(60/100) ≈ 0.97

E(Status Other) = – (50/120) · log₂(50/120) – (70/120) · log₂(70/120) ≈ 0.98

Weighted Average of entropy for each node:

E(Status) = 80/300 · 0.54 + 100/300 · 0.97 + 120/300 · 0.98 = 0.86

Information Gain:

IG(Cat, Marital Status) = E(Cat) – E(Status) = 0.99 – 0.86 = 0.13

First Split

The information gain from feature ‘Account’ is far more than the information gain from the feature ‘Status’. It reduces more disorder in the target variable. A decision tree algorithm would use this result to make the first split on the data using ‘Account’.

From here on, the decision tree algorithm would use this process at every split to decide what feature it is going to split on next.

Further Steps

The decision tree makes local decisions at each step. Meaning that, after each split these calculations should be done for that node again to decide the further splits. So the splitting order adapts dynamically as the tree grows based on the current data subsets.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary sources were A Mathematical Theory of Communication by Claude Shannon and Google Machine Learning Education.