Intuitively speaking, the key to estimating the ICA model is nongaussianity. Actually, without nongaussianity the estimation is not possible at all, as mentioned in Sec. 3.3. This is at the same time probably the main reason for the rather late resurgence of ICA research: In most of classical statistical theory, random variables are assumed to have gaussian distributions, thus precluding any methods related to ICA.
The Central Limit Theorem, a classical result in probability theory, tells that the distribution of a sum of independent random variables tends toward a gaussian distribution, under certain conditions. Thus, a sum of two independent random variables usually has a distribution that is closer to gaussian than any of the two original random variables.
Let us now assume that the data vector is distributed according to the ICA data model in Eq. 4, i.e. it is a mixture of independent components. For simplicity, let us assume in this section that all the independent components have identical distributions. To estimate one of the independent components, we consider a linear combination of the xi (see eq. 6); let us denote this by , where is a vector to be determined. If were one of the rows of the inverse of , this linear combination would actually equal one of the independent components. The question is now: How could we use the Central Limit Theorem to determine so that it would equal one of the rows of the inverse of ? In practice, we cannot determine such a exactly, because we have no knowledge of matrix , but we can find an estimator that gives a good approximation.
To see how this leads to the basic principle of ICA estimation, let us make a change of variables, defining . Then we have . y is thus a linear combination of si, with weights given by zi. Since a sum of even two independent random variables is more gaussian than the original variables, is more gaussian than any of the si and becomes least gaussian when it in fact equals one of the si. In this case, obviously only one of the elements zi of is nonzero. (Note that the si were here assumed to have identical distributions.)
Therefore, we could take as a vector that maximizes the nongaussianity of . Such a vector would necessarily correspond (in the transformed coordinate system) to a which has only one nonzero component. This means that equals one of the independent components!
Maximizing the nongaussianity of thus gives us one of the independent components. In fact, the optimization landscape for nongaussianity in the n-dimensional space of vectors has 2 nlocal maxima, two for each independent component, corresponding to si and -si (recall that the independent components can be estimated only up to a multiplicative sign). To find several independent components, we need to find all these local maxima. This is not difficult, because the different independent components are uncorrelated: We can always constrain the search to the space that gives estimates uncorrelated with the previous ones. This corresponds to orthogonalization in a suitably transformed (i.e. whitened) space.
Our approach here is rather heuristic, but it will be seen in the next section and Sec. 4.3 that it has a perfectly rigorous justification.