Intuitively speaking, the key to estimating the ICA model is nongaussianity. Actually, without nongaussianity the estimation is not possible at all, as mentioned in Sec. 3.3. This is at the same time probably the main reason for the rather late resurgence of ICA research: In most of classical statistical theory, random variables are assumed to have gaussian distributions, thus precluding any methods related to ICA.
The Central Limit Theorem, a classical result in probability theory, tells that the distribution of a sum of independent random variables tends toward a gaussian distribution, under certain conditions. Thus, a sum of two independent random variables usually has a distribution that is closer to gaussian than any of the two original random variables.
Let us now assume that the data vector
is distributed according
to the ICA data model in Eq. 4, i.e. it is a mixture of
independent components. For simplicity, let us assume in this section
that all the
independent components have identical distributions. To estimate one
of the independent components,
we consider a linear combination of the xi (see eq. 6); let
us denote this
by
,
where
is a vector to be
determined. If
were one of the rows of the
inverse of
,
this linear combination would actually equal one of
the independent components. The question is now: How could we use the
Central Limit Theorem to determine
so that it would equal one of
the rows of the inverse of
? In practice, we cannot
determine such a
exactly, because we have no knowledge of matrix
,
but we can find an estimator that gives
a good approximation.
To see how this leads to the basic principle of ICA estimation, let us
make a change of variables, defining
.
Then we have
.
y is thus a linear combination of si,
with weights given by zi. Since a sum of even two independent
random variables is more gaussian than the original variables,
is more gaussian than any of the si and becomes least
gaussian when it in fact equals one of the si. In this case,
obviously only one of the elements zi of
is nonzero. (Note
that the si were here assumed to have identical distributions.)
Therefore, we could take as
a vector that maximizes the
nongaussianity of
.
Such a vector would necessarily
correspond (in the transformed coordinate system) to a
which has only
one nonzero
component. This means that
equals one of the independent
components!
Maximizing the nongaussianity of
thus gives us one of the
independent components. In fact, the optimization landscape for
nongaussianity in the n-dimensional space of vectors
has 2 nlocal maxima, two for each independent component, corresponding to
si and -si (recall that the independent components can be
estimated only up to a multiplicative sign). To find several
independent components, we need to find all these local maxima. This
is not difficult, because the different independent components are
uncorrelated: We can always constrain the search to the space that
gives estimates uncorrelated with the previous ones. This corresponds
to orthogonalization in a suitably transformed (i.e. whitened) space.
Our approach here is rather heuristic, but it will be seen in the next section and Sec. 4.3 that it has a perfectly rigorous justification.