No Title

Next: References

Independent Component Analysis, Blind Source Separation, and Projection Pursuit
Aapo Hyvärinen Helsinki University of Technology Laboratory of Computer and Information Science P.O. Box 2200, FIN-02015 HUT, Finland aapo.hyvarinen@hut.fi http://www.cis.hut.fi/~ aapo/ Slides presented at the EC Summer School on Bayesian Signal Processing Cambridge, UK, 24 July 1998

Independent Component Analysis.

Observed (zero-mean) random vector ${\bf x}$ is modelled by a linear latent variable model [31,11]:

$\begin{displaymath}{\bf x}={\bf A}{\bf s}\end{displaymath}$ (1)

where

The 'mixing' matrix ${\bf A}$ is constant, usually square.
Latent variables s_i are mutually independent and nongaussian.
Estimate both ${\bf A}$ and ${\bf s}$ , observing only ${\bf x}$ .
The s_i defined only up to a multiplicative constant.
The s_i are not ordered.

Whitening (or decorrelation) of ${\bf x}$ is not enough to estimate the model [31]: $\resizebox {6cm}{!}{\includegraphics{Suni.eps}}$ $\resizebox {6cm}{!}{\includegraphics{Xuni.eps}}$ $\resizebox {6cm}{!}{\includegraphics{Vuni.eps}}$

whitening gives ${\bf A}$ only up to an orthogonal transformation.
Whitening uses only correlation matrix: $\approx n^2/2$ equations, but ${\bf A}$ has n² elements.
After whitening, ${\bf A}$ can be considered orthogonal.
Therefore, model cannot be estimated for gaussian data!
Higher-order information enables estimation of model [31,11].

Basic intuitive principle.

(Sloppy version of) the Central Limit Theorem [8,14,13,20].

Consider a linear combination ${\bf w}^T{\bf x}$ .
a_i s_i+a_j s_j is more gaussian than s_i.
Maximizing the nongaussianity of ${\bf w}^T{\bf x}$ , we can find s_i.
Also known as projection pursuit.
Problem: how to measure nongaussianity?

Measures of nongaussianity.

(normalize x to unit variance)

1. Absolute value of kurtosis (fourth-order cumulant) [14,20,13]

$\begin{displaymath}\vert\:\mbox{kurt}(x)\vert=\vert E\{x^4\}-3\vert.\end{displaymath}$

(2)

2. Differential entropy [11,14,20]:

$\begin{displaymath}H(x)=E \{-\log p(x)\}. \end{displaymath}$

(3)

3. Approximations of entropy [23]

$\begin{displaymath}J_G(x)=(E\{G(x)\}-E\{G(\nu)\})^2\end{displaymath}$

(4)

$\nu$

statistical properties not bad (for suitable G)
computationally simple
a good compromise?

4. Other measures (for the record):

other than 4th-order cumulants (skewness) [14,20]
Fisher information $E\{ [(\log p)'(x)]^2 \}$ . [20]
L² distances [18,12]

Illustration.

For whitened data, maximize e.g. $\vert\:\mbox{kurt}({\bf w}^T{\bf x})\vert$ , with $\Vert{\bf w}\Vert=1$ :

$\resizebox {8cm}{!}{\includegraphics{circle.eps}}$

Whitened data.

$\resizebox {8cm}{!}{\includegraphics{kurt.eps}}$

Modulus of kurtosis as a function of angle. Maxima are obtained in the directions of the independent components.

To estimate several ICs, use the constraint of decorrelation [13,28].

Maximum likelihood estimation.

Log-likelihood of the model: ( ${\bf W}=\widehat{{\bf A}^{-1}}$ ) [43,7,8]

$\begin{displaymath}L=\sum_{t=1}^T \sum_{i=1}^n \log p_{s_i}({\bf w}_i^T {\bf x}(t)))+T\log\vert\det {\bf W}\vert\end{displaymath}$

(5)

Equivalent to the infomax approach in neural networks [5,7,38].
Needs estimates of the p_si, but these need not be exact at all [9,8].
If estimates constrained to be white, essentially equivalent to above one-by-one estimation (for differential entropy and its approximations).
MAP estimation: Jeffreys' prior [42] $-\log\vert\det{\bf W}\vert$ does not change much.

Information-theoretic approach.

Mutual information of n random variables ${\bf y}=(y_1,...,y_n)^T$ [8,11]:

$\begin{displaymath}I(y_1,...,y_n)=\sum_{i=1}^n H(y_i)-H({\bf y})\end{displaymath}$

(6)

A measure of redundancy of ${\bf y}$ . Equals zero iff y_i are independent.
For ${\bf y}={\bf W}{\bf x}$ , we obtain

$\begin{displaymath}I({\bf y})= \sum_{i=1}^n E\{- \log p_{y_i}(y_i)\}-\log\vert\det {\bf W}\vert+C\end{displaymath}$

(7)

Very similar to likelihood!

Summary of ICA estimation principles.

Most approaches differ little.
All approaches can be interpreted as maximizing the nongaussianity of ICs.
Basic choice: the nonquadratic ('contrast') function in the nongaussianity measure:

kurtosis: fourth power
log of density
robust alternatives (in approx. of entropy) [21,22].

One-by-one estimation vs. estimation of the whole model.
Estimates constrained to be white vs. no constraint

Algorithms (1). Adaptive gradient methods

Gradient methods for one-by-one estimation straightforward [13,27,29,22].
Stochastic gradient ascent for likelihood [5,43]:

$\begin{displaymath}\Delta{\bf W}\propto ({\bf W}^{-1})^T+g({\bf W}{\bf x}) {\bf x}^T\end{displaymath}$

(8)

$g=(\log p_{s})'$

Better: natural/relative gradient ascent of likelihood [1,2,9,8,10]:

$\begin{displaymath}\Delta{\bf W}\propto [{\bf I}+g({\bf y}) {\bf y}^T]{\bf W}\end{displaymath}$

(9)

${\bf y}=\hat{{\bf s}}={\bf W}{\bf x}$

${\bf W}^T{\bf W}$

Algorithms (2). Fixed-point algorithm [16,21,28]

An approximate Newton method in block (batch) mode.
No matrix inversion, but still quadratic (or cubic) convergence.
No parameters to be tuned.
For a single IC (whitened data)

$\begin{displaymath}{\bf w}^+=E\{{\bf x}g({\bf w}^T{\bf x})\}-E\{g'({\bf w}^T{\bf x})\}{\bf w}, \mbox{normalize } {\bf w}\end{displaymath}$

(10)

For likelihood:

$\begin{displaymath}{\bf W}^+={\bf W}+{\bf D}_1[{\bf D}_2+E\{g({\bf y}) {\bf y}^T\}]{\bf W},\mbox{orthonormalize } {\bf W}\end{displaymath}$

(11)

The FastICA MATLAB package on the WWW [16].

Relations to other methods.

Source separation by decorrelation [6,45]:

nongaussianity not used as additional information
time-delayed correlations used instead

in contrast to ICA, assumes that data is time signals with spectral properties
Projection pursuit [20,17]

useful for visualization, exploratory data analysis
equivalent to one-by-one ICA estimation

Factor analysis [19,34]: ICA is a nongaussian (usually noise-free) version
Blind deconvolution [14,44]: obtained by constraining the mixing matrix
Principal component analysis [30]

often the same applications
very different statistical principles

Extensions of basic ICA model.

1. Noisy ICA: ${\bf x}={\bf A}{\bf s}+{\bf n}$

EM algorithm [37]: computationally complex
cumulant-based algorithms [35]: statistically problematic
bias reduction techniques [15,25]: perhaps most promising

2. More observations than independent components

simple solution (?): reduce dimension by PCA [31]

3. Less observations than independent components

ML algorithms still possible [41,36]
quite complicated

Applications (1). Blind source separation

Four ICs ('source signals'):

$\resizebox {4.0cm}{3.0cm}{\includegraphics{/home/info/aapo/tex/thesis/s1.eps}}$ $\resizebox {4.0cm}{3.0cm}{\includegraphics{/home/info/aapo/tex/thesis/s2.eps}}$ $\resizebox {4.0cm}{3.0cm}{\includegraphics{/home/info/aapo/tex/thesis/s3.eps}}$ $\resizebox {4.0cm}{3.0cm}{\includegraphics{/home/info/aapo/tex/thesis/s4.eps}}$

Due to some external circumstances, only linear mixtures of the source signals are observed.

$\resizebox {4.0cm}{3.0cm}{\includegraphics{/home/info/aapo/tex/thesis/y1.eps}}$ $\resizebox {4.0cm}{3.0cm}{\includegraphics{/home/info/aapo/tex/thesis/y2.eps}}$ $\resizebox {4.0cm}{3.0cm}{\includegraphics{/home/info/aapo/tex/thesis/y3.eps}}$ $\resizebox {4.0cm}{3.0cm}{\includegraphics{/home/info/aapo/tex/thesis/y4.eps}}$

Problem: Estimate (separate) original signals from mixtures!
- Applications: biomedical signals [46,47,48], telecommunications, audio noise cancelling (cocktail-party problem).

Applications (2). Feature extraction.

Barlow's theory of sparse coding, redundancy reduction [3].
ICA on image windows gives Gabor/wavelet-like filters:

nongaussianity of features useful e.g. for denoising [26].

Applications (3). Misc.

exploratory data analysis (cf. factor analysis, projection pursuit) [20,17]
visualization (like projection pursuit) [20,17]
regression [24]

Conclusions.

ICA is very simple as a model:

Estimation not so simple due to nongaussianity:

Estimation by maximizing nongaussianity of independent components.
Algorithms: adaptive (natural gradient descent) vs. block/batch mode (fixed-point).
Classical application: blind source separation.
Promising applications on feature extraction.

Next: References

Aapo Hyvarinen

8/27/1998