Robust Higher Order Statistics: Max Welling

Robust Higher Order Statistics
Max Welling
School of Information and Computer Science
University of California Irvine
Irvine CA 92697-3425 USA
welling@ics.uci.edu
Abstract
An undesirable property of moments is the fact that lower

order moments can have a dominating influence on the
value of higher order moments. For instance, when the
mean is large it will have a dominating effect on the second order moment,
Sample estimates of moments and cumulants are

known to be unstable in the presence of outliers.
This problem is especially severe for higher order
statistics, like kurtosis, which are used in algorithms for independent components analysis and
projection pursuit. In this paper we propose robust generalizations of moments and cumulants
that are more insensitive to outliers but at the
same time retain many of their desirable properties. We show how they can be combined into series expansions to provide estimates of probability density functions. This in turn is directly relevant for the design of new robust algorithms for
ICA. We study the improved statistical properties
such as B-robustness, bias and variance while in
experiments we demonstrate their improved behavior.
E[x2 ] = E[x]2 + E[x E[x]]2
The second term which measures the variation around the

mean, i.e. the variance, is a much more suitable statistic for
scale than the second order moment. This process of subtracting lower order information can be continued to higher
order statistics. The resulting estimators are called centralized moments or cumulants. Well known higher order cumulants are skewness (third order) measuring asymmetry
and kurtosis (fourth order) measuring peakiness of the
probability distribution . Explicit relations between cumulants and moments are given in appendix A (set 0 = 1
for the classical case). Since cumulants are functions of
moments up to the same order, they also suffer from high
sensitivity to outliers.
1 INTRODUCTION
Moments and cumulants are widely used in scientific disciplines that deal with data, random variables or stochastic processes. They are well known tools that can be used
to quantify certain statistical properties of the probability
distribution like location (first moment) and scale (second
moment). Their definition is given by,
n = E[xn ]
(2)
(1)
where E[] denotes the average over the probability distribution p(x). In practise we have a set of samples from the
probability distribution and compute sample estimates of
these moments. However, for higher order moments these
estimates become increasingly dominated by outliers, by
which we will mean the samples which are far away from
the mean. Especially for heavy tailed distributions this implies that these estimates have high variance and are generally unsuitable to measure properties of the distribution.
Many statistical methods and techniques use moments and

cumulants because of their convenient properties. For instance they follow easy transformation rules under affine
transformations. Examples in the machine learning literature are certain algorithms for independent components
analysis [3, 2, 1]. A well known downside of these algorithm is their sensitivity to outliers in the data. Thus, there
is a need to define robust cumulants which are relatively insensitive to outliers but retain most of the convenient properties that moments and cumulants enjoy. This will be the
topic of this paper.
2 MOMENTS AND CUMULANTS

A formal definition of the relation between moments and
cumulants to all orders can be given in terms the characteristic function (or moment generating function) of a probability distribution,
(t) = E[eixt ] =
X
1
n (it)n
n!
n=0
(3)
where the last expression follows by Taylor expanding the

exponential. The cumulants can now be defined by
on an extensive body of literature [5][6] to compute robust

estimates of location and scale.
X
1
n (it)n = ln (t)
n!
n=0
As will become apparent in the following, a convenient

choice for the robust moments is given by the following
expression,
(4)
where we expand the right hand side in powers of (it) and

match terms at all orders.
The generalization of the above to the multivariate case is
straightforward. Moments are defined as expectations of
monomials,
i1 ,...,im = E[xi1 ....xim ]
(5)
and the cumulants are again defined through the characteristic function (see Eq.7), where in addition to the univariate
cumulants we now also have cross-cumulants.
From the definition of the cumulants in terms of the moments we can derive a number of interesting properties,
which we will state below. It will be our objective to conserve most of these properties when we define the robust
cumulants.
Lemma 1 The following properties are true for cumulants:
Definition 1 The robust moments are given by:
(x)
()
i1 ...in = E (xi1 ) . . . (xin )
1 (6)
(x)
where (x) is the multivariate standard normal density.
The decaying factor is thus given by (x)
=
(x)
d exp( 12 (2 1)xT x), where d is the dimension of the
space. In the limit 1 we obtain the usual definition of
moments.
In order to preserve most of the desirable properties that
cumulants obey, we will use the same definition to relate
moments to cumulants as in the classical case,
Definition 2 The robust cumulants are defined by:
X
M
X
...
n=0 i1 =1
I. For a Gaussian density, all cumulants higher than second order vanish.
II. For independent random variables,
cumulants vanish.
all cross-
III. All cumulants transform multi-linearly with respect to

affine transformations.
IV. All cumulants higher than first order are invariant
with respect to translations.
The proofs of these statements can for instance be found in
[9] and are very similar to the proofs for the robust cumulants which we will present in the next section.
3 ROBUST MOMENTS AND

CUMULANTS
In this section we will define robust moments and cumulants by introducing an isotropic decay factor which downweights outliers. With this decay factor we will have introduced a preferred location and scale. We therefore make
the following important assumption: The probability density function has zero-mean and unit-variance (or covariance equal to the identity in the multivariate case). This
can always be achieved by a linear transformation of the
random variables. Analogously, data will need to be centered and sphered. One may worry that these preprocessing
steps are non-robust operations. Fortunately, we can rely
ln(
M
X
m=0 j1 =1
M
X
1 ()
(iti ) . . . (itin ) =
n! i1 ...in 1
i =1
n
...
M
X
1 ()
(itj1 ) . . . (itjm ))
m! j1 ...jm
j =1
(7)
The right hand side can again be defined as the logarithm

of the moment generating function for robust moments,
(x)
() (t) = E exp(ixT t)
(8)
(x)
The explicit relation between robust moments and cumulants up to fourth order is given in appendix A.
With the above definitions we can now state some important properties for the robust cumulants. Since we assume
zero-mean and unit-variance we cannot expect the cumulants to be invariant with respect to translation and scalings. However, we will prove that the following properties
are still valid,
Theorem 1 The following properties are true for robust
cumulants:
I. For a standard Gaussian density, all robust cumulants
higher than second order vanish.
II. For independent random variables, robust crosscumulants vanish.
III. All robust cumulants transform multi-linearly with respect to rotations.
Proof: I: For a standard Gaussian we can compute the moment generating function analytically giving () (t) =
()
21 tT t, implying that i1 i2 = i1 i2 and all other cumulants vanish.
II: We note that if the variables {xi } are independent,
() (t) factorizes into a product of expectations which the
logarithm turns into a sum, each term only depending on
one ti . Since cross cumulants on the left hand side of Eq.7
are precisely those terms which contain distinct ti , they
must be zero.
III: From Eq.6 we see that since the decay factor is
isotropic, robust moments still transform multi-linearly
with respect to rotations. If we rotate both the moments
and t in the right-hand side of Eq.7, it remains invariant.
To ensure that the left-hand side of Eq.7 remains invariant we infer that the robust cumulants must also transform
multi-linearly with respect to rotations,
()
()
i1 ...in Oi1 j1 . . . Oin jn j1 ...jn ,

This concludes the proof.
OOT = OT O = I
(9)
4 ROBUST GRAM-CHARLIER AND

EDGEWORTH EXPANSIONS
Assuming we have computed robust cumulants (or equivalently robust moments) up to a given order, can we combine them to provide us with an estimate of the probability
density function? For the classical case it is long known
that the Gram-Charlier and Edgeworth expansions are two
possibilities [8]. In this section we will show that these expansions can be generalized to the robust case as well. To
keep things simple, we will discuss the univariate case here.
Multivariate generalizations are relatively straightforward.
Both robust Gram-Charlier and Edgeworth expansions will
be defined as series expansions in the scaled Hermite polynomials Hn (x).
p(x) =
c()
n Hn (x)(x) with
(10)
n=0
c()
n
1
=
n!
p(x)Hn (x)1 (x) d
(11)
where we have defined the measure d = (x) dx and

used the following generalized orthogonality relation,
Z
Hn (x)Hm (x) d = n! nm
(12)
given by the following theorem1 ,

Theorem 2 The series expansion of a density p(x) in terms
of its robust cumulants is given by
()
We may also express the above series expansion directly in

terms of the robust cumulants. The explicit expression is
(x) (P
e n=0
(x)
with
()
()
n = n n,2
1 ()
dn
n (1)n d(x)
n
n!
(x) (13)
(14)
Proof: see appendix B.

To find an explicit expression up to a certain order in the
robust cumulants, one expands the exponential and uses
dn
(1)n dx
n (x) = Hn (x)(x) to convert derivatives into
Hermite polynomials.
Analogous to the classical literature we will talk about a
()
Gram-Charlier expansion when we expand in cn and an
()
Edgeworth expansion when we expand in n . Their only
difference is therefore in their convention to break the series off after a finite number of terms.
When = 1 the Hermite expansions discussed in this section will be normalized, even when only a finite number of
terms is taken
P into account. This holds since H0 = 1 and
c0 = 1/N n 1 = 1, while all higher order polynomials
are orthogonal to 1. When generalizing to robust cumulants this however no longer holds true. To correct this we
will add an extra term to the expansion,
pR (x) = {
R
X
c()
n Hn (x) + (x)}(x),
(15)
n=0
The correction factor can be computed by a Gram-Schmidt

procedure resulting in,
R
X
X
(x)
an Hn (x)).
(x) n=0
n=0
(16)
n
2
2
with an = (n1)!!
(
1)
for
k
{0,
1,
2,
3,
...}
n,2k
n!
and (n1)!! denotes the double factorial of (n1) defined
by 1 3 5...(n 1). The correction factor is thus orthogonal to all Hermite polynomials Hn (x) with n = 1..R
under the new measure d . We can also show that pR (x)
always integrates to 1 and that when 1 the correction
term will reduce to (x) cR+K HR+K (x) with K = 1
when R is oddR and K = 2 when R is even.
Finally we
note that since 2 (x)/(x)dx = 1/( 2 2 ) the

(x) = (1
When cn is estimated by averaging over samples (Eq.25),

we see that the decay factor (x)
(x) will again render them
robust against outliers.
p(x) =
The equivalent result in the multivariate case is,
p(x) =
(
n!an c()
n )(
n=0
with
(x)
(x)
PM
i1 =1 ...
()
PM
n
1 ()
d
in =1 n!
i1 ...in (1) d(x)i
1
()
i1 i2 = i1 i2 i1 i2
d
... d(x)
in
(x)
x 10
Bias for exponential density (a=1.5)
Asymptotic variance & inverse Fisher information (a=1.5)
2.4
2.2
trace() & trace(J1)
4.5
bias
4
3.5
3
2.5
Proof: It is straightforward to compute the influence function defined in Eq.17,
2
1.8
1.6
IF (x) =
1.4
1.2
2
1.5
1
1.2
1.4
1.6
1.8
1
1
1.2
1.4
(a)
4
x 10
1.6
1.8
1.5
1.4
trace() & trace(J1)
bias
1.5
1.3
1.2
1.1
1
1.2
1.4
1.6
1.8
0.9
1
1.2
(c)
1.4
1.6
1.8
(d)
Figure 1: (a)-Bias as a function of 2 for a generalized Laplacian

with a = 1.5 (super-Gaussian). (b)-Asymptotic variance (solid
line) and inverse Fisher information (dashed line) as a function of
2 for a = 1.5. (c)-(d) Similar plots for a = 4 (sub-Gaussian)
correction is only normalizable for 2 < 2, which is what
we will assume in the following.
5 CONSISTENCY, ROBUSTNESS, BIAS

AND VARIANCE
In this section we will examine the robustness, bias and
efficiency of our generalized expansion. Many definitions
in this section are taken from [5]. Our analysis will assume
that the data arrive centered and sphered, which allows us
to focus on the analysis of the higher order statistics. For
a thorough study of the robustness properties of first and
second order statistics see [5].
()
First we mention that the estimators cn [pR ] for the truncated series expansion (Eq.15) are Fisher consistent. This
can be shown by replacing p(x) in Eq.11 with pR (x) and
using orthogonality between (x) and the Hermite polynomials Hn (x) n = 1..R w.r.t. the measure d .
To prove B-robustness we need to define and calculate the
()
influence function IF for the estimators cn . Intuitively,
the influence function measures the sensitivity of the estimators to adding one more observation at location x,
()
()
cn [(1 t)pR + tx ] cn [pR ]

.
IF (x) = lim
t0
t
1
(x)
Hn (x)
c()
n
n!
(x)
(18)
Since for > 1 this IF is finite everywhere, the result

follows.
()
Asymptotic variance & inverse Fisher information (a=4)
2.5
0.5
1
(b)
Bias for exponential density (a=4)
()
Theorem 3 The estimates cn [pR ] are B-robust for > 1.
(17)
An estimator is called B-robust if its influence function is

finite everywhere. We will now state the following result.
Since cumulants are simple functions of the cn up to the

same order, we conclude that cumulants are also B-robust.
It is important to notice that in the classical case ( = 1) the
theorem does not hold, confirming that classical cumulants
are not robust. Analogously one can show that the sensitivity to shifting data-points is also bounded for > 1.
We now turn to the analysis of bias and variance. It is well
known that the point-wise mean square error can be decomposed into a bias and a variance term,
h
i
(N )
(N )
MSEx (pR (x)) = E (pR (x) p(x))2 =
h
i
(N )
E (pR (x) pR (x))2 + (pR (x) p(x))2
(19)
(N )
where pR is the estimate of pR using a sample of size N .

The expectation E is taken over an infinite number of those
samples. Clearly, the first term represents the variance and
the second the bias which is independent of N . The variance term (V ) may be rewritten in terms of the influence
function,
R
1 X
()
2
(c()
n , cm )Hn (x)Hm (x) (x) (20)
N n,m=0
Z
()
()
(c()
,
c
)
=
p(x)IF (x, c()
n
m
n )IF (x, cm )dx (21)
V =
So the variance decreases as 1/N with sample size while

the data independent part is completely determined by
the asymptotic covariance matrix which is expressed in
terms of the influence function.
Finally, by defining the Fisher information as,
1
()
J(c()
,
c
)
=
E
p
(x)
p
(x)
R
R
n
m
p(x) c()
p(x) c()
n
m
p
Z
Hn (x)Hm (x)(x)2
dx
(22)
=
p(x)
the well known Cramer-Rao

() ()
() ()
(cn , cm ) J 1 (cn , cm ).
bound
follows:
In figure 1 we plot the bias and the total variation (trace

of the covariance) as a function of 2 for a super-Gaussian
and a sub-Gaussian density (generalized Laplace density
p exp(b|x|a ) with unit variance and a = 1.5 and a = 4
respectively) . The trace of the inverse Fisher information
was also plotted (dashed line). The model included 10 orders in the expansion n = 0, ..., 9 plus the normalization
term (x). All quantities were computed using numerical
integration. We conclude that both bias and efficiency improve when moves away from the classical case = 1.
Histogram of sound data

450
number of samples in bin
400
6 INDEPENDENT COMPONENTS
ANALYSIS
I(O) =
n=1 i=1
200
150
100
(23)
0
x
10
Figure 2: Histogram of sound-data (5000 samples).

()
()
where
i...i only differ from the usual i...i in second or()
()
der,
ii = ii 1. These cumulants are defined on the
0
rotated axis ei = OT ei .
We will now state a number of properties that show the
validity of I(O) as a contrast function for ICA,
Theorem 4 The following properties are true for I(O):
i. I(O) is maximal if the probability distribution on the
corresponding axis factors into an independent product of marginal distributions.
ii. I(O) is minimal (i.e. 0) if the marginal distributions
on the corresponding axis are Gaussian.
Proof: To prove (i) we note that the following expression
is scalar (i.e. invariant) w.r.t. rotations2 ,
X ()
(
i1 ...in )2 = constant
n
(24)
i1 ...in
We now note that this expression can be split into two

terms: a sum over the diagonal terms where i1 =
i2 = . . . = in and a sum over all the remaining crosscumulant terms. When all directions are independent all
cross-cumulants must vanish by property II of theorem 1.
This minimizes the second term (since its non-negative).
Hence, by the fact the sum of these terms is constant, the
first term, which equals I(O), must be maximal for independent directions.
To prove (ii) we invoke property I of theorem 1 that
for Gaussian random variables all cumulants
must
vanish.
By the above theorem we see that I(O) simultaneously
searches for independent directions and non-Gaussian directions. Observe however, that for practical reasons we
have ignored cumulants of order higher than R. Hence,
there will certainly be more than one distribution which
2
wn 0,
250
0
10
We propose to use the robust Edgeworth and GramCharlier expansions discussed in this paper instead of the
classical ones. As we will show in the experiments below,
it is safe to include robust cumulants to very high order in
these expansions (we have gone up to order 20), which at a
moderate computational cost will have a significant impact
on the accuracy of our estimates of the marginal distributions. We note that the derivation of the contrast function in e.g. [4] crucially depends on properties I,II and III
from theorem 1. This makes our robust cumulants the ideal
candidates to replace the classical ones. Instead of going
through this derivation we will argue for a novel contrast
function that represents a slight generalization of the one
proposed in [4],
()
wn (
i...i )2
300
50
Although robust moments and cumulants can potentially

find applications in a broad range of scientific disciplines,
we will illustrate their usefulness by showing how they can
be employed to improve algorithms for independent components analysis (ICA). The objective in ICA is to find a
new basis for which the data distribution factorizes into a
product of independent one-dimensional marginal distributions. To achieve this, one first removes first and second order statistics from the data by shifting the sample
mean to the origin and sphering the sample covariance to
be the identity matrix. These operations render the data decorrelated but higher order dependencies may still remain.
It can be shown [2] that if an independent basis exists, it
must be a rotation away from the basis in which the data is
de-correlated, i.e. xica = Oxdecor where O is a rotation.
One approach to find O is to propose a contrast function
that, when maximized, returns a basis onto which the data
distribution is a product of independent marginal distributions. Various contrast functions have been proposed, e.g.
the neg-entropy [4] and the mutual information [1]. All
contrast functions share the property that they depend on
the marginal distributions which need to be estimated from
the data. Naturally, the Edgeworth expansion [4, 3] and the
Gram-Charlier expansion [1] have been proposed for this
purpose. This turns these contrast functions into functions
of moments or cumulants. However, to obtain reliable estimates one needs to include cumulants of up to fourth order.
It has been observed frequently that in the presence of outliers these cumulants often become unreliable (e.g. [7]).
R X
M
X
350
For vectors this reduces to the statement that an inner product

is scalar. To prove the general case we use OT O = I for every
index separately.
Expansion coefficients for sound data (=1)
Estimated pdf for sound data (=1)
Exponential density (a=1)
600
0.35
0.7
0.8
0.3
0.6
0.6
0.25
0.5
0.4
p(x)
300
p(x)
400
0.4
0.2
0.15
200
0.3
0.2
100
0.1
0.2
0
0
100
0.2
10
0 1 2 3 4 5 6 7 8 9 10
expansion coefficient number
0.05
0.1
5
0
x
(a)
0
5
10
(b)
Expansion coefficients for sound data (=1.8)
0
x
0
5
(b)
Mixture of Gaussians (=0.5, c=3, d=2)
Mixture of Gaussians (=0.3, c=3, d=0)

0.7
0.5
0
x
(a)
Estimated pdf for sound data (=1.8)
1.2
0.9
0.6
0.8
0.8
0.4
0.2
p(x)
0.3
p(x)
0.6
0.4
0.5
0.2
0.4
0.3
0.4
0.3
0.2
0.2
0.1
0.1
0.2
0.4
0.5
0.7
0.6
p(x)
value of expansion coefficient
Exponential density (a=4)
0.8
500
p(x)
value of expansion coefficient
700
0.1
0
10
0 1 2 3 4 5 6 7 8 9 10
expansion coefficient number
(c)
0
x
10
0
5
0
5
(d)
(c)
(d)
Figure 3: (a)-Expansion coefficients for classical Gram-Charlier
Figure 4: Top row: Generalized Laplace distributions with (a)
expansion ( = 1). (b)-Density estimate for = 1 after four

orders. The negative tails signal the onset of a diverging series.
(c)-Decreasing expansion coefficients for = 1.8. (f)-Density
estimate after 10 orders for = 1.8.
a = 1, (b) a = 4. Bottom row: Mixture of Gaussians with (c)

= 0.3, c = 3, d = 0 and (d) = 0.5, c = 3, d = 2.
maximizes I(O) (for instance distributions which only differ in the statistics of order higher than R). Good objective
functions are discriminative in the sense that there are only
few (relevant) densities that maximize it. We can influence
the ability of I(O) to discriminate by changing the weighting factors wn . Doing this allows for a more directed search
towards predefined qualities, e.g. a search for high kurtosis
directions would imply a large w4 .
A straightforward strategy to maximize I(O) is gradient
ascent while at every iteration projecting the solution back
onto the manifold of rotations (e.g. see [10]). A more efficient technique which exploits the tensorial property of
cumulants (i.e. property III of theorem 1) was proposed in
[3]. This technique, called Jacobi-optimization, iteratively
solves two dimensional sub-problems analytically.
7 EXPERIMENTS
The following set of experiments focus on density estimates based on the Gram-Charlier expansion (Eq.10)
where we replace Eq.11 with a sample estimate,
c()
n =
N
1 1 X (xA )
Hn (xA )
N n!
(xA )
(25)
A=1
The reason we focus on this task is that we can demonstrate

robustness by showing that low order robust statistics are
always dominant over higher order robust statistics, even
for heavy tailed distributions. Yet at the same time they

carry the relevant information of the probability density, i.e.
they combine into an accurate estimate of it. This exercise
is also relevant for cumulant based algorithms for independent components analysis because they rely on the fact that
the Gram-Charlier or Edgeworth expansions describe the
source distributions well.
Sound Data
We downloaded recordings from music CDs 3 and extracted 5000 samples from it. The histogram is shown in
figure 2. Due to the presence of outliers we expect the classical expansion to break down. This can be observed from
figure (3a) where the coefficients increase with the order of
the expansion. In figure (3b) we see that the density estimate has become negative in the tails after 4 orders, which
is an indication that the series has become unstable. In figures (3c,d) we see that for the robust expansion at = 1.8
the coefficients decrease with order and the estimate of the
density is very accurate after 10 orders.
Synthetic Data
In this experiment we sampled 5000 data-points from two
generalized Laplace densities p exp(b|x|a ) (figures
4a,b) and from two mixtures of two Gaussians parameterized as
pmog (x) = a(ax + b) + c(1 )(cx + d) (figures
4c,d). These include super-Gaussian distributions (figures
3
http://sweat.cs.unm.edu/bap/demos.html
0.01
0.008
0.006
0.004
0.002
0
1
1.2
1.4
1.6
1.8
x 10
1.5
1
1.2
1.4
(a)
1.8
x 10
0.04
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
10
15
order of expansion
(a)
20
0.03
0.025
0.02
0.015
0.01
0.005
0
0
10
15
order of expansion
20
(b)
Figure 6: L2 -distance as a function of the order of the expansion

for (a) 2 = 1 and (b) 2 = 1.9 for the generalized Laplace PDF
with a = 1.
0.03
L2 distance for exponential density (a=1, =1.9)
0.08
Quality of fit for mix. of Gauss. (=0.5, c=3, d=2)
L2distance between pdf and estimate
0.05
1.6
(b)
Quality of fit for mixture of Gaussians (=0.3, c=3, d=0)
0.02
0.01
0
1
L2 distance for exponential density (a=1, =1)
Quality of fit for exponential density (a=4)

2.5
Quality of fit for exponential density (a=1)
0.012
1.2
1.4
1.6
(c)
1.8
1
1
1.2
1.4
1.6
1.8
(d)
Figure 5: Top row: total L2 distance between true and estimated

densities as function of 2 for generalized Laplace density with
(a) a = 1, (b) a = 4. Bottom row: same as top row for the
mixture of Gaussians distributions with (c) = 0.3, c = 3, d = 0
and (d) = 0.5, c = 3. The corresponding densities are shown
in figure 4. Dashed line indicates the best estimate over all orders.
4a,c), a sub-Gaussian density (figures 4b) and an asymmetric density (figures 4d). We plot the total L2 distance between the estimate and the true density as we vary (figures 5a,b,c,d). Shown is the best estimate over all orders orders (dashed line) and the final estimate after 20 orders. In
both cases it is observed that the best estimates are obtained
around 2 2 (but recall that 2 < 2, see section 4. We
also plot the L2 distance between true and estimated density as a function of the order of the expansion for 2 = 1
and 2 = 1.9 (a = 1) in figures (6a,b). Clearly, the robust
expansion converges while the classical expansion is unstable. Finally, in figure 7 we compare the best estimated
PDFs for the general Laplace density at a = 1 with 2 = 1
(a) and 2 = 1.9 (b).
The general conclusion from these experiments is that in
all cases (super- or sub-Gaussian PDF, symmetric or asymmetric PDF) we find that the quality (in L2 -norm) of the
estimated densities improves considerably when we use the
robust series expansion with a setting of 2 close to (but
smaller than) 2. This effect is more pronounced for superGaussian densities than for sub-Gaussian densities.
cumulants invariance w.r.t. translations was lost and the

class of transformations under which they transform multilinearly was reduced from affine to orthogonal (i.e. rotations). However, all other cumulant properties were conveniently preserved. We argue that by first centering and
sphering the data (using robust techniques described in the
literature [5]), multi-linearity w.r.t. orthogonal transformations is all we need, which could make the trade-off with
improved robustness properties worthwhile.
There is two well-known limitations of cumulants that one
needs to be aware of. Firstly, they are less useful as statistics characterizing the PDF if the mass is located far away
from the mean. Secondly, the number of cumulants grows
exponentially fast with the dimensionality of the problem.
With these reservations in mind, many interesting problems
remain, even in high dimensions, that are well described by
cumulants of low dimensional marginal distributions, as the
ICA example has illustrated.
The sensitivity to outliers can be tuned with the parameter
2 [1, 2). Our experiments have shown that if one includes many orders in the expansion, optimal performance
was obtained when 2 was close to (but smaller than) 2.
Although unmistakeably some information is ignored by
weighting down the impact of outliers, the experiments indicated that the relevant information to estimate the PDF
was mostly preserved. In future experiments we hope to
show that this phenomenon is also reflected in improved
performance of ICA algorithms based on robust cumulants.
A ROBUST MOMENTS AND

CUMULANTS TO 4TH ORDER
8 DISCUSSION
In this paper we have proposed robust alternatives to higher
order moments and cumulants. In order to arrive at robust
This appendix contains the definition of the cumulants in

terms of the moments and vice versa for general . We
have not denoted explicitely in the following for nota-
Best estimate for exponential density (a=1,=1.9)

0.8
0.7
0.7
0.6
0.6
0.5
0.5
p(x)
p(x)
Best estimate for exponential density (a=1,=1)

0.8
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
10
0.1
5
0
x
10
0
10
(a)
0
x
10
(b)
Figure 7: Best estimates for the generalized Laplace density at

a = 1. In (a) we plot the best classical estimate which is found
after four orders of Hermite polynomials are taken into account
(i.e. H0 (x), ..., H4 (x)). For higher orders, the series becomes
unstable and the calculation of the expansion coefficients is too
sensitive to sample fluctuations. The best estimate from the robust
expansion is depicted in (b). In that case the best estimate is found
when all orders are taken into account, i.e. 20.
tional convenience.
1
2
1
2 =
( )2
0
0
0
3
1 3
1 2
3 =
3 2 + 2( )
0
0
0
2 2
1
4
2 2
1 3
4 =
3( ) 4 2 + 12 1 3 6( )4
0
0
0
0
0
1
2
0
2
0 = e
= 1
= 2 + 1
0
0
3
= 3 + 32 1 + 31
0
4
= 4 + 43 1 + 322 + 62 21 + 41
0
0 = ln 0
1 =
B PROOF OF THEOREM 2
The characteristic function or moment generating function
of a PDF is defined by:
Z
X
1
(t) =
eixt p(x) dx =
n (it)n = F[p(x)]
n!
n=0
(26)
where the last term follows from Taylor expanding the exponential and F denotes the Fourier transform. For arbitrary we have,
Z
(x)
()
(t) =
eixt p(x)
dx
(x)
X
1 (
(x)
=
)n (it)n dx = F[p(x)
].
(27)
n!
(x)
n=0
Where in the last equality the definition of the generalized

moments (Eq.6) was used. () is the (robust) moment
generating function of p(x). We can find an expression for

p(x) if we invert the Fourier transform,
Z
(x) 1
eixt () (t) dt.
(28)
p(x) =
(x) 2
Next, we use the relation between the cumulants and the
moments (Eq.7) to write,
Z
P
n
1 ()
(x) 1
p(x) =
eixt e n=0 n! n (it) dt.
(x) 2
(29)
()
()
By defining
n = n n,2 we can separate a factor
(t) (Gaussian) inside the integral,
Z
P
n
1 ()
(x) 1
p(x) =
eixt e n=0 n! n (it) (t) dt.
(x) 2
(30)
Finally, we will need the result
dn
2
1
n
(1)n
(x).
(31)
F [(it) (t)] =
d(x)n
If we expand the exponential containing the cumulants in a
Taylor series, and do the inverse Fourier transform on every
term separately, after which we combine the terms again in
an exponential, we find the desired result (Eq.14).
References
[1] S. Amari, A. Cichocki, and H.H. Yang. A new algorithm
for blind signal separation. Advances in Neural Information
Processing Systems, 8:757763, 1996.
[2] A.J. Bell and T.J. Sejnowski. The independent components
of natural scenes are edge filters. Vision Research, 37:3327
3338, 1997.
[3] J.F. Cardoso. High-order constrast for independent component analysis. Neural Computation, 11:157192, 1999.
[4] P. Comon. Independent component analysis, a new concept?
Signal Processing, 36:287314, 1994.
[5] F.R. Hampel, E.M. Ronchetti, P.J. Rousseuw, and W.A. Stahel. Robust statistics. Wiley, 1986.
[6] P.J. Huber. Robust statistics. Wiley, 1981.
[7] A. Hyvarinen. New approximations of differential entropy
for independent component analysis and projection pursuit.
In Advances in Neural Information Processing Systems, volume 10, pages 273279, 1998.
[8] M.G. Kendall and A. Stuart. The advanced theory of statistics Vol. 1. Griffin, 1963.
[9] P. McCullagh. Tensor Methods in Statistics. Chapman and
Hall, 1987.
[10] M. Welling and M. Weber. A constrained EM algorithm
for independent component analysis. Neural Computation,
13:677689, 2001.

Robust Higher Order Statistics: Max Welling

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robust Higher Order Statistics: Max Welling

Uploaded by

Copyright:

Available Formats

Robust Higher Order Statistics

An undesirable property of moments is the fact that lower

Sample estimates of moments and cumulants are

E[x2 ] = E[x]2 + E[x E[x]]2

The second term which measures the variation around the

Many statistical methods and techniques use moments and

2 MOMENTS AND CUMULANTS

where the last expression follows by Taylor expanding the

on an extensive body of literature [5][6] to compute robust

As will become apparent in the following, a convenient

where we expand the right hand side in powers of (it) and

Definition 1 The robust moments are given by:

III. All cumulants transform multi-linearly with respect to

3 ROBUST MOMENTS AND

The right hand side can again be defined as the logarithm

i1 ...in Oi1 j1 . . . Oin jn j1 ...jn ,

4 ROBUST GRAM-CHARLIER AND

p(x)Hn (x)1 (x) d

where we have defined the measure d = (x) dx and

given by the following theorem1 ,

We may also express the above series expansion directly in

Proof: see appendix B.

The correction factor can be computed by a Gram-Schmidt

note that since 2 (x)/(x)dx = 1/( 2 2 ) the

When cn is estimated by averaging over samples (Eq.25),

The equivalent result in the multivariate case is,

Bias for exponential density (a=1.5)

Asymptotic variance & inverse Fisher information (a=1.5)

Proof: It is straightforward to compute the influence function defined in Eq.17,

Figure 1: (a)-Bias as a function of 2 for a generalized Laplacian

5 CONSISTENCY, ROBUSTNESS, BIAS

cn [(1 t)pR + tx ] cn [pR ]

Since for > 1 this IF is finite everywhere, the result

Asymptotic variance & inverse Fisher information (a=4)

Bias for exponential density (a=4)

Theorem 3 The estimates cn [pR ] are B-robust for > 1.

An estimator is called B-robust if its influence function is

Since cumulants are simple functions of the cn up to the

where pR is the estimate of pR using a sample of size N .

So the variance decreases as 1/N with sample size while

the well known Cramer-Rao

In figure 1 we plot the bias and the total variation (trace

Histogram of sound data

number of samples in bin

Figure 2: Histogram of sound-data (5000 samples).

We now note that this expression can be split into two

Although robust moments and cumulants can potentially

For vectors this reduces to the statement that an inner product

Expansion coefficients for sound data (=1)

Estimated pdf for sound data (=1)

Exponential density (a=1)

Expansion coefficients for sound data (=1.8)

Mixture of Gaussians (=0.3, c=3, d=0)

Estimated pdf for sound data (=1.8)

value of expansion coefficient

Exponential density (a=4)

value of expansion coefficient

Figure 3: (a)-Expansion coefficients for classical Gram-Charlier

Figure 4: Top row: Generalized Laplace distributions with (a)

expansion ( = 1). (b)-Density estimate for = 1 after four

a = 1, (b) a = 4. Bottom row: Mixture of Gaussians with (c)

The reason we focus on this task is that we can demonstrate

for heavy tailed distributions. Yet at the same time they