Machine Learning: Entropy and Classification

A Simple Classification Example

Let’s say we have a dataset with categorical features , , , and a binary target variable :

id Feature Feature Feature Target Variable
1 a c e
2 b d e
3 b d f
4 a d e
5 a c f
6 b d f

The goal is to find the feature that best predicts the value of .

After a bit of close inspection, we see that Feature will have the highest “predictive power” in determining :

The other features and are not as accurate as in predicting the target. We call Feature as the most informative attribute.


It might seem an easy task to determine the most informative attribute in the dataset above by hand. But for larger, more complex datasets, it’s impractical to count and match all the features with target classes for all samples. We need a way to measure the informativeness of a feature, that is, how it effectively classifies the dataset. Fortunately, we have entropy to help us do that.

Entropy is a term used in statistical physics as a measure of how disordered a system is. Machine learning’s use of entropy isn’t far from this concept of disorderedness. In ML, a set of instances is said to be disordered when there’s a considerable mix of target classes that the instances belong to. This means that the more mixed the segment is with respect to the classifications (i.e., target variables), the higher the entropy.

The entropy of a set with a binary target with values and is defined as

where is the proportion of the ith target variable in the set. Entropy always ranges between 0 (minimum disorder) and 1 (maximum disorder).

As an exercise, let’s calculate for the whole dataset:

This tells us that a set whose samples are equally divided among the target’s values is maximally disordered.

If we have two or more target classes, the equation above generalizes as follows:

where is the set of unique values of the target variable.

It’s easy to see that the number of elements in is equal to the number of terms in the summation.

Classifying datasets with categorical attributes is all about segmenting data in a way that the resulting subsets have lower entropy than the whole dataset, all without penalizing the model’s performance when it’s time to classify unknown data (i.e., prevent the model from overfitting).

Information Gain

The informativeness of a feature can be measured using the concept of information gain, which can be defined using entropy. Consider a dataset which we call the parent, and we’d like to know how well a certain feature classifies the elements contained in the parent.

We define the information gain of a feature on some parent dataset as

where is the set of unique values of feature , is the entropy of the parent set, is the proportion of instances belonging to the th value of the considered feature, and is the entropy of the set whose elements have the th value of the feature.

Now let’s apply the definition of information gain for every feature in our dataset above.

Information Gain

The information gain for Feature , whose values are the set , is

and in the summation are the proportion of instances where Feature has values and , respectively.

Now and are the entropies of the subsets whose instances have Feature equal to and , respectively.

Thus, the information gain for Feature is

Information Gain

Given the following values,

we see that the value of is zero.

This means that there is no information gained when using Feature to classify the dataset.

Information Gain

Given the following values,

we see that the value of is 1, the highest information gain among all the features.

Using the definitions and tools outlined above, we can now formally state why Feature is the most informative attribute. It’s because it has the highest information gain among all the features in the dataset.

What about numeric features?

We may begin to wonder about regression problems, where the target variable is a numeric one instead of a categorical one. How do we calculate for the most informative attribute in this case?

Information gain still makes sense here. But it must represent a numerical feature’s “closeness” to the target variable. Instead of using entropy as a measure of disorder (or order) with regards to the categorical target values, we use variance as a measure of closeness to the target value.

A feature will have a high variance if its values do not behave as the target values behave. Low variance occurs when the feature increases (decreases) in value, the target also increases (decrease) in value. In other words, the counterpart of information gain in regression problems is correlation.

Here’s a table outlining which quantities are analogous to each other between classification and regression problems.

  Measure of Orderness Measure of Informativeness
Classification Entropy Information Gain
Regression Variance Correlation

Hope this was insightful!

Machine Learning: Entropy and Classification - April 2, 2017 - Adler Santos