Mutual Information

Next: Defining ICA by Mutual Up: Minimization of Mutual Information Previous: Minimization of Mutual Information

Mutual Information

Using the concept of differential entropy, we define the mutual information I between m (scalar) random variables, y_i, i=1...m as follows

$\begin{displaymath} I(y_1,y_2,...,y_m)=\sum_{i=1}^m H(y_i)-H({\bf y}). \end{displaymath}$

(24)

Mutual information is a natural measure of the dependence between random variables. In fact, it is equivalent to the well-known Kullback-Leibler divergence between the joint density $f({\bf y})$ and the product of its marginal densities; a very natural measure for independence. It is always non-negative, and zero if and only if the variables are statistically independent. Thus, mutual information takes into account the whole dependence structure of the variables, and not only the covariance, like PCA and related methods.

Mutual information can be interpreted by using the interpretation of entropy as code length. The terms H(y_i) give the lengths of codes for the y_i when these are coded separately, and $H({\bf y})$ gives the code length when ${\bf y}$ is coded as a random vector, i.e. all the components are coded in the same code. Mutual information thus shows what code length reduction is obtained by coding the whole vector instead of the separate components. In general, better codes can be obtained by coding the whole vector. However, if the y_i are independent, they give no information on each other, and one could just as well code the variables separately without increasing code length.

An important property of mutual information [36,8] is that we have for an invertible linear transformation ${\bf y}={\bf W}{\bf x}$ :

$\begin{displaymath} I(y_1,y_2,...,y_n)=\sum_i H(y_i)-H({\bf x})-\log\vert\det {\bf W}\vert. \end{displaymath}$

(25)

Now, let us consider what happens if we constrain the y_i to be uncorrelated and of unit variance. This means $E\{{\bf y}{\bf y}^T\}={\bf W}E\{{\bf x}{\bf x}^T\}{\bf W}^T={\bf I}$ , which implies $\det {\bf I}=1=(\det {\bf W}E\{{\bf x}{\bf x}^T\}{\bf W}^T)=(\det {\bf W}) (\det E\{{\bf x}{\bf x}^T\}) (\det {\bf W}^T)$ , and this implies that $\det {\bf W}$ must be constant. Moreover, for y_i of unit variance, entropy and negentropy differ only by a constant, and the sign. Thus we obtain,

$\begin{displaymath} I(y_1,y_2,...,y_n)=C-\sum_i J(y_i). \end{displaymath}$

(26)

where C is a constant that does not depend on ${\bf W}$ . This shows the fundamental relation between negentropy and mutual information.

Next: Defining ICA by Mutual Up: Minimization of Mutual Information Previous: Minimization of Mutual Information

Aapo Hyvarinen
2000-04-19