In the preceding sections, we introduced new contrast (or objective) functions for ICA based on minimization of mutual information (and projection pursuit), analyzed some of their properties, and gave guidelines for the practical choice of the function G used in the contrast functions. In practice, one also needs an algorithm for maximizing the contrast functions in (7) or (8).
A simple method to maximize the contrast function would be to use stochastic gradient descent; the constraint could be taken into account by a bigradient feedback. This leads to neural (adaptive) algorithms that are closely related related to those introduced in [24]. We show in the Appendix B how to modify the algorithms in [24] to minimize the contrast functions used in this paper.
The advantage of neural on-line learning rules is that the inputs
can be used in the algorithm at once,
thus enabling faster adaptation in a non-stationary environment. A
resulting trade-off, however, is that the convergence is slow, and
depends on a good choice of the learning rate sequence, i.e. the step
size at each iteration. A bad choice of the learning rate can, in
practice, destroy convergence. Therefore, it would important in
practice to make the learning faster and more reliable.
This can be achieved by the
fixed-point iteration algorithms that we introduce here.
In the fixed-point algorithms, the computations are made in batch (or block)
mode, i.e., a large number of data points are used in a single step of the
algorithm. In other respects, however, the algorithms may be considered
neural. In particular, they are parallel, distributed,
computationally simple, and require little memory space.
We will show below that the fixed-point algorithms have very appealing
convergence properties, making them a very interesting alternative to
adaptive learning rules in environments where fast real-time adaptation is
not necessary.
Note that our basic ICA algorithms require a preliminary sphering or
whitening of the data ,
though also some versions for non-sphered
data will be given.
Sphering means that the original observed variable, say
is linearly transformed to a variable
such that the correlation matrix of
equals unity:
.
This transformation is always possible; indeed, it can
be accomplished by classical PCA.
For details, see [7,12].