Mutual information and Kullback-Leibler divergence

Next: Non-linear cross-correlations Up: Multi-unit contrast functions Previous: Likelihood and network entropy

Mutual information and Kullback-Leibler divergence

Theoretically the most satisfying contrast function in the multi-unit case is, in our view, mutual information.

Using the concept of differential entropy defined in Eq. (6), one defines the mutual information I between m (scalar) random variables y_i, i=1...m, as follows

$\begin{displaymath} I(y_1,y_2,...,y_m)=\sum_i H(y_i)-H({\bf y}). \end{displaymath}$

(15)

where H denotes differential entropy. The mutual information is a natural measure of the dependence between random variables. It is always non-negative, and zero if and only if the variables are statistically independent. Thus the mutual information takes into account the whole dependence structure of the variables. Finding a transform that minimizes the mutual information between the components s_i is a very natural way of estimating the ICA model [36]. This approach gives at the same time a method of performing ICA according to the general Definition 1 in Section 3. For future use, note that by the properties of mutual information, we have for an invertible linear transformation ${\bf y}={\bf W}{\bf x}$ :

$\begin{displaymath} I(y_1,y_2,...,y_m)=\sum_i H(y_i)-H({\bf x})-\log\vert\det {\bf W}\vert. \end{displaymath}$

(16)

The use of mutual information can also be motivated using the Kullback-Leibler divergence, defined for two probability densities f₁ and f₂ as

$\begin{displaymath}\delta(f_1,f_2)=\int f_1({\bf y}) \log \frac{f_1({\bf y})}{f_2({\bf y})} d{\bf y}. \end{displaymath}$

(17)

The Kullback-Leibler divergence can be considered as a kind of a distance between the two probability densities, though it is not a real distance measure because it is not symmetric. Now, if the y_i in (15) were independent, their joint probability density could be factorized as in the definition of independence in Eq. (7). Thus one might measure the independence of the y_i as the Kullback-Leibler divergence between the real density $f({\bf y})$ and the factorized density $\tilde{f}({\bf y})=f_1(y_1)f_2(y_2)...f_m(y_m)$ , where the f_i(.) are the marginal densities of the y_i. In fact, this quantity equals the mutual information of the y_i.

The connection to the Kullback-Leibler divergence also shows the close connection between minimizing mutual information and maximizing likelihood. In fact, the likelihood can be represented as a Kullback-Leibler distance between the observed density and the factorized density assumed in the model [26]. So both of these methods are minimizing the Kullback-Leibler distance between the observed density and a factorized density; actually the two factorized densities are asymptotically equivalent, if the density is accurately estimated as part of the ML estimation method.

The problem with mutual information is that it is difficult to estimate. As was mentioned in Section 2, to use the definition of entropy, one needs an estimate of the density. This problem has severely restricted the use of mutual information in ICA estimation. Some authors have used approximations of mutual information based on polynomial density expansions [36,1], which lead to the use of higher-order cumulants (for definition of cumulants, see Appendix A).

The polynomial density expansions are related to the Taylor expansion. They give an approximation of a probability density f(.) of a scalar random variable y using its higher-order cumulants. For example, the first terms of the Edgeworth expansion give, for a scalar random variable y of zero mean and unit variance [88]:

$\begin{displaymath}f(\xi)\approx \varphi(\xi)(1+\kappa_3(y) h_3(\xi)/6+\kappa_4(y) h_4(\xi)/24+...) \end{displaymath}$

(18)

where $\varphi$ is the density function of a standardized Gaussian random variable, the $\kappa_i(y)$ are the cumulants of the random variable y (see Appendix A), and h_i(.) are certain polynomial functions (Hermite polynomials). Using such expansions, one obtains for example the following approximation for mutual information

$\begin{displaymath} I({\bf y})\approx C+\frac{1}{48}\sum_{i=1}^m [4\kappa_3(y_i)... ...kappa_4(y_i)^2+7\kappa_4(y_i)^4-6\kappa_3(y_i)^2\kappa_4(y_i)] \end{displaymath}$

(19)

where C is constant; the y_i are here constrained to be uncorrelated. A very similar approximation was derived in [1], and also earlier in the context of projection pursuit in [78].

Cumulant-based approximations such as the one in (19) simplify the use of mutual information considerably. The approximation is valid, however, only when f(.) is not far from the Gaussian density function, and may produce poor results when this is not the case. More sophisticated approximations of mutual information can be constructed by using the approximations of differential entropy that were introduced in [64], based on the maximum entropy principle. In these approximations, the cumulants are replaced by more general measures of nongaussianity, see Section 4.4.3 and Section 4.4.1. Minimization of such an approximation of mutual information was introduced in [65].

Next: Non-linear cross-correlations Up: Multi-unit contrast functions Previous: Likelihood and network entropy

Aapo Hyvarinen
1999-04-23