Mutual information

In probability theory and information theory, the mutual information of two random variables is a quantity that measures the independence of the two variables. The unit of measurement of mutual information is the bit.

Informally, mutual information measures the information of X that is shared by Y. If X and Y are independent, then X contains no information about Y and vice versa, so their mutual information is zero. If X and Y are identical then all information conveyed by X is shared with Y: knowing X reveals nothing new about Y and vice versa, therefore the mutual information is the same as the information conveyed by X (or Y) alone, namely the entropy of X. In a specific sense (see below), mutual information quantifies the distance between the joint distribution of X and Y and the product of their marginal distributions.

Formally, in the discrete case, if the joint probability density function (with respect to counting measure, since this is the discrete case) of X and Y is p with p(x, y) = Pr(X=x, Y=y), the probability density function of X alone is f with f(x) = Pr(X=x), and the probability density function of Y alone is g with g(y) = Pr(Y=y), then the mutual information of X and Y is given by I(X, Y), defined as follows for the discrete case:

I(X,Y) = \sum_{x,y} p(x,y) \times \log_2 \frac{p(x,y)}{f(x)\,g(y)}. \!

In the continuous case, replace summation by a definite double integral:

I(X,Y) = \int_{(-\infty,\infty) \times (-\infty,\infty)} p(x,y) \times \log_2 \frac{p(x,y)}{f(x)\,g(y)} \; d(x,y). \!

Mutual information is a measure of independence in the following sense: I(X, Y) = 0 iff X and Y are independent random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = f(x) × g(y), and therefore:

\log \frac{p(x,y)}{f(x)\,g(y)} = \log 1 = 0. \!

Moreover, mutual information is nonnegative (i.e. I(X,Y) ≥ 0; see below) and symmetric (i.e. I(X,Y) = I(Y,X)).

Several generalizations of mutual information to more than two random variables have been proposed, but a widely agreed on definition has not yet emerged.

Relation to other quantities

Mutual information can be equivalently expressed as

I(X,Y) = H(X) − H(X | Y) = H(Y) − H(Y | X) = H(X) + H(Y) − H(X,Y)

where H(X) and H(Y) are entropies, H(X|Y) and H(Y|X) are conditional entropies, and H(Y,X) is the joint entropy between X and Y.

Since H(X) > H(X|Y), this characterization is consistent with the nonnegativity property stated above.

Mutual information can also be expressed in terms of the Kullback-Leibler divergence between the joint distribution of two random variables X and Y and the product of their marginal distributions. Let q(x, y) = f(x) × g(y); then

I(X,Y) = KL(p,q).

Furthermore, let hy(x) = p(x, y) / g(y). Then

I(X,Y) = \sum_y g(y) \sum_x h_y(x) \times \log_2 \frac{h_y(x)}{f(x)} \!
= \sum_y g(y) \; \mathit{KL}(h_y,f) \!
= \mathrm{E}_Y[\mathit{KL}(h_y,f)]. \!

Thus mutual information can also be understood as the expectation of the Kullback-Leibler divergence between the conditional distribution h of X given Y and the univariate distribution f of X: the more different the distributions f and h, the greater the information gain.

Applications of mutual information

In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy. Examples include:

References

See also: Mutual information, Athanasios Papoulis, Bit, Collocation, Conditional distribution, Conditional entropy, Continuous, Coordinate system, Corpus linguistics, Discrete