Entropy & Information Gain Explained

akhil anand
4 min readJul 25, 2022
Source

What is Entropy ?

Entropy is used to measure the impurity in dataset. The Highly impure feature will be considered as root node(having high entropy) and purest node will be considered as leaf node(low entropy).

Entropy

Below is the data points which indicate whether you will play cricket or not.

Table 1

P(Y|1)= Number of times a person can play cricket/Total Data Points

Why Negative sign in Entropy formula?

If we will take log of probability then it will always gives result in terms of negative value. If we will not take negative sign then the high entropy distribution will have more negative entropy & it will indicate high entropy dataset as less impure so by putting negative sign, the high impure datapoint will have high entropy. log in entropy formula indicates log with base 2.

Let’s calculate entropy based on above data points;

P(Y|1)=5/8,
P(Y|0)=3/8 ;
H(Y) = -{(5/8)log(5/8)+(3/8)log(3/8)}
H(Y)=-{-0.13-0.16}
H(Y)=0.29
Graphical Representation of Entropy

Entropy is symmetric in nature the outcome we will get at 0.1 will be same at 0.9. Let’s understand above figure by mathematical interpretation.

Suppose D represents the overall dataset and D1 represent the dataset with outcome 1 and D2 represents the outcome 0.

Case 1 :- When 90% of the data points belongs to dataset D1 and 10 % of data points belong to dataset D2, then entropy would be;

H(Y)=-{P(Y|D1)log(p(Y|D1)+P(Y|D1)log(p(Y|D1)}
H(Y)=-{(0.90)log(0.90)+(0.10)log(0.10)}
H(Y)=-{-0.136-0.332}
H(Y)=0.468

Case 2:- When 50% of the data points belongs to dataset D1 and 50 % of data points belong to dataset D2, then entropy would be;

H(Y)=-{P(Y|D1)log(p(Y|D1)+P(Y|D1)log(p(Y|D1)}
H(Y)=-{(0.50)log(0.50)+(0.50)log(0.50)}
H(Y)=-{-0.5-0.5}
H(Y)=1

Case 3:- When 100% of the data points belongs to dataset D1 and 0% of data points belong to dataset D2, then entropy would be;

H(Y)=-{P(Y|D1)log(p(Y|D1)+P(Y|D1)log(p(Y|D1)}
H(Y)=-{(1)log(1)+(0)log(0)}
H(Y)=-{-0-0}
H(Y)=0

Equiprobable datasets will be highly impure in nature and will be responsible for gaining more and more information out of them.

Information Gain

Dataset

What is Information Gain ?

Suppose someone will give you sales dataset and ask you to find customer retention, How would you find ?

  1. First of all you will se how many people have actually purchased the product.
  2. If they have purchased the product more than one times then you can say the list of people have been retained.

Same way information gain works in decision tree. It breaks data/problem into smaller smaller parts. More the data will be broken into smaller parts more the information will be obtained, and decision tree will be more intuitive. Hence with the help of information gain we will get deeper and deeper information about data and prediction will be more generalized.

Formulation

Suppose a dataset Y has been broken into number of different dataset i.e.:-

Y -----> Y1,Y2,Y3,Y4,...,Yk

then Information Gain is Defined as;

Information Gain

Where H(D) is the entropy of whole dataset D.

Mathematical Calculation

H(D) :- Entropy of overall dataset D.

Above dataset has total 14 data points out of which 5 datapoints have outcome as No and 9  datapoints have outcome as Yes then entropy of overall dataset would be;H(Y)=-[{P(Y|Yes)*log2(P(Y|Yes))}+ {P(Y|No)*log2(P(Y|No))}]
H(Y)=-[{(9/14)*log2(9/14)}+{(5/14)*log2(5/14)}]
H(Y)=-[-0.409-0.530]=0.94

Now we will calculate information gain after splitting the original dataset into smaller dataset D1,D2& D3 where these three dataset consists of different outlook conditions i.e.:- Sunny, Overcast & Rainy. Now the information gain would be;

IG=0.97-[{(5/14)*0.97}+{(4/14)*(0)}+{(5/14)*0.97}]
IG=0.247

With the help of information gain we will grow the decision tree until it reaches the pure Node.

Conclusion :-

Please comment below if you have any suggestions regarding this blog. Keep Learning Keep Growing.

--

--