Using the concept of differential entropy, we define
the mutual information *I* between *m* (scalar) random variables,
*y*_{i},
*i*=1...*m* as follows

Mutual information is a natural measure of the dependence between random variables. In fact, it is equivalent to the well-known Kullback-Leibler divergence between the joint density and the product of its marginal densities; a very natural measure for independence. It is always non-negative, and zero if and only if the variables are statistically independent. Thus, mutual information takes into account the whole dependence structure of the variables, and not only the covariance, like PCA and related methods.

Mutual information can be interpreted by using the interpretation of
entropy as code length. The terms *H*(*y*_{i}) give the lengths of codes
for the *y*_{i} when these are coded separately, and
gives the
code length when
is coded as a random vector, i.e. all the
components are coded in the same code. Mutual information thus shows
what code length reduction is obtained by coding the whole vector
instead of the separate components. In general, better codes can be
obtained by coding the whole vector. However, if the *y*_{i} are independent,
they give no information on each other, and one could just as well
code the variables separately without increasing code length.

An important property of mutual information [36,8] is that we
have for an
invertible linear transformation
:

Now, let us consider what happens if we constrain the

where