Non Parametric Estimation For Financial Investment Under Log-Utility

Nonparametric Estimation for
Financial Investment under Log-Utility
Von der Fakultät Mathematik der Universität Stuttgart

zur Erlangung der Würde eines Doktors der
Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung
Vorgelegt von
Dominik Schäfer
aus Pforzheim
Hauptberichter: Prof. Dr. H. Walk

Mitberichter: Prof. Dr. V. Claus
Prof. Dr. L. Györfi
Tag der mündlichen Prüfung: 15. Juli 2002
Mathematisches Institut A der Universität Stuttgart
2002
dedicated to
My Parents
to whom I owe so much
Professor Paul Glendinning

without him I might never have found my way
to mathematical finance and economics
1
CONTENTS
Abbreviations 3
Summary 7
Zusammenfassung 19
Acknowledgements 31
1 Introduction: investment and nonparametric statistics 33

1.1 The market model . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.2 Portfolios and investment strategies . . . . . . . . . . . . . . . . 37
1.3 Pleading for logarithmic utility . . . . . . . . . . . . . . . . . . 40
2 Portfolio benchmarking: rates and dimensionality 47

2.1 Rates of convergence in i.i.d. models . . . . . . . . . . . . . . . 48
2.2 Dimensionality in portfolio selection . . . . . . . . . . . . . . . . 61
2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3 Predicted stock returns and portfolio selection 73

3.1 A strategy using predicted log-returns . . . . . . . . . . . . . . . 74
3.2 Prediction of Gaussian log-returns . . . . . . . . . . . . . . . . . 77
3.2.1 An approximation result . . . . . . . . . . . . . . . . . . 80
3.2.2 An estimation algorithm . . . . . . . . . . . . . . . . . . 81
3.3 Proof of the approximation and estimation results . . . . . . . . 86
3.4 Simulations and examples . . . . . . . . . . . . . . . . . . . . . 97
2
4 A Markov model with transaction costs: probabilistic view 103

4.1 Strategies in markets with transaction fees . . . . . . . . . . . . 104
4.2 An optimal strategy . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.1 Some comments on Markov control . . . . . . . . . . . . 110
4.2.2 Proof of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . 111
4.3 Further properties of the value function . . . . . . . . . . . . . . 126
5 A Markov model with transaction costs: statistical view 129

5.1 The empirical Bellman equation . . . . . . . . . . . . . . . . . . 129
5.1.1 An optimal strategy . . . . . . . . . . . . . . . . . . . . 131
5.1.2 How to prove optimality . . . . . . . . . . . . . . . . . . 135
5.2 Uniformly consistent regression estimation . . . . . . . . . . . . 135
5.3 Proving the optimality of the strategy . . . . . . . . . . . . . . . 145
6 Portfolio selection functions in stationary return processes 151

6.1 Portfolio selection functions . . . . . . . . . . . . . . . . . . . . 152
6.2 Estimation of log-optimal portfolio selection functions . . . . . . 155
6.3 Checking the properties of the estimation algorithm . . . . . . . 161
6.3.1 Proof of the convergence Lemma 6.2.1 . . . . . . . . . . 161
6.3.2 Proof of the related Theorems 6.2.2 - 6.2.4 . . . . . . . . 169
6.4 Simulations and examples . . . . . . . . . . . . . . . . . . . . . 175
L’Envoi 180
References 181
3
ABBREVIATIONS
|·| absolute value of a number, cardinality of a set

< ·, · > Euclidean scalar product
k · k∞ supremum norm
k · kq q-norm (on IRd or Lq )
k·k other norm
IN positive integers 1, 2, 3, ...

IN0 nonnegative integers 0, 1, 2, 3, ...
IR real numbers
IR+ real numbers > 0
IR+0 real numbers ≥ 0
bxc integer part of x

bxcN the smallest kN (k ∈ IN) such that kN ≥ x ≥ 0.
dxe x rounded toward infinity
·T transpose of a vector or matrix

spr(·) spectrum of a matrix
exp exponential to the base e

log logarithm to the base e
lb logarithm to the base 2
an = o(bn ) Landau symbol for: an /bn → 0

an = O(bn ) Landau symbol for: an /bn is a bounded sequence
AC complement of the set A

Ā closure of the set A
conv(A) convex hull of the set A
4 Abbreviations
1A characteristic function of the set A

diam(A) Euclidean diameter supa,b∈A ka − bk∞ of the set A
ρ(x, A) Euclidean distance inf a∈A kx − ak∞ from x to the set A
H(A, B) Hausdorff distance max{supa∈A ρ(a, B), supb∈B ρ(b, A)}
between the sets A and B
B(S) Borelian σ−algebra on the topological space S
f (x)|x=y f evaluated in y
f + , f+ positive part of f , i.e., max{f, 0}
f − , f− negative part of f , i.e., max{−f, 0}
supp f support {x : f (x) > 0} of the function f
arg max f solution of a maximization problem (in some contexts
set-valued, i.e. {x : f (x) = supy f (y)}, in others a
measurably selected solution x with f (x) = supy f (y))
P probability measure
PX distribution of X
PY |X=x conditional distribution of Y given X = x
fX (·) a density of PX w.r.t. the Lebesgue measure
fY |X (·|x) a density of PY |X=x w.r.t. the Lebesgue measure
Q1 Q2 Q1 is absolutely continuous w.r.t. Q2
D(Q1 ||Q2 ) Kullback-Leibler distance of Q2 and Q1
a.s. P-almost surely, with probability one
P-a.a. P-almost all
E mathematical expectation
E[Y |X] conditional expectation of Y given X
E[Y |X = x] conditional expectation of Y given X = x
Var Variance
Cov Covariance
N (µ; Σ) normal distribution with mean µ and variance-
covariance matrix Σ
L1 (P) space of Lebesgue integrable functions w.r.t. P

Lq (P) qth order Lebesgue integrable functions w.r.t. P
5
const. a suitable constant

GSM geometrically strongly mixing
hot. higher order terms of an expansion
i.i.d. independent, identically distributed
p.a. per annum
w.r.t. with respect to
2 end of proof
All non-standard notation is explained when it occurs for the first time. The
random variables in this thesis are understood to be defined on a common
probability space (Ω, A, P). IRd -valued random variables are implicitly assumed
to be measurable w.r.t. the Borelian σ-algebra B(IRd ). If not stated otherwise,
0
measurability of functions f : IRd → IRd means measurability w.r.t. B(IRd ) and
0
B(IRd ).
6
7
SUMMARY
In this thesis we aim to plead for the application of nonparametric statistical

forecasting and regression estimation methods to financial investment problems.
In six chapters we explore applications of nonparametric techniques to portfolio
selection for financial investment. Clearly, this cannot be more than a crude
and somewhat arbitrary selection of topics within this vast area, so we decided
to concentrate on some typical situations. Our hope is to be able to illustrate
the benefits of nonparametric estimation methods in portfolio selection.
Chapter 1
Introduction: investment and nonparametric statistics
Investment is the strategic allocation of resources, typically of monetary re-
sources, in an environment, typically a market of assets, whose future evolution
is uncertain. Investment problems arise in a huge variety of contexts beyond the
financial one. Resources may also take the form of energy, of data-processing
resources, etc. Strategic investment planning helps to run many processes with
higher benefit. In this thesis we focus our attention on financial investment,
which we think is the “prototypical” example of a resource allocation process.
The three ingredients of financial investment are the market, the actions the
investor may take and his investment goal (discussed in detail in Sections 1.1-
1.3):
– As to the market: We assume that there are m assets in our financial market.
The jth asset yields a return Xi,n on an investment of 1 unit of money
during market period n (lasting from “time” n − 1 to n, time being mea-
sured, e.g., in days of trading). The ensemble of returns on the nth day
of trading is given by
Xn = (X1,n , ..., Xm,n)T ∈ IRm

+.
8
To the investor, the return process {Xn }∞ n=1 appears to be a stochastic

process which, in many real markets, is stationary and ergodic (Definition
1.1.1). In some chapters we impose additional (but realistic) conditions
on the distribution of the process. The key point is, however, that
we use nonparametric models, i.e. models that do not assume a para-

metric evolution equations such as ARMA, ARCH and GARCH equa-
tion to hold.
These models guarantee highest flexibility in real applications.
– As to the investment actions: We are concerned with an investor who neither

consumes nor deposits new money into his portfolio. At the beginning of
each market period n, our investor uses all his current wealth to acquire
a portfolio bn of the stocks. It will be convenient to describe the portfolio
bn by the proportion bj,n of the investor’s current wealth invested in asset
j (j = 1, ..., m) during market period n. Thus, bn is chosen at time n − 1
from the set S of all portfolios, consisting of the vectors (portfolios)
bn = (b1,n, ..., bm,n)T
satisfying bj,n ≥ 0 and m

P
j=1 bj,n = 1. In some situations the set of in-
vestment actions S may be further narrowed down by the occurence of
transaction costs.
– As to the investment goal: If W0 is his initial wealth, an investor using

n−1
the portfolio strategy {bi}i=0 manages to accumulate the wealth Wn =
Qn
i=1 < b i , X i > W0 during n market periods (< ·, · > is the Euclidean
scalar product). Naturally, the investor aims to maximize Wn . It is known
from literature that there is no essential conflict between short run (n
finite) and long term (n → ∞) investment. In both cases investment
according to the conditional log-optimal portfolio
b∗n := arg max E [log < b, Xn >| Xn−1 , ...X1]

b∈S
at time n is optimal, outperforming any other strategy because of

Wn 1 Wn
E ≤ 1 and lim sup log ∗ ≤ 0 with probability 1
Wn∗ n→∞ n Wn
9
(Cover and Thomas, 1991, Theorem 15.5.2). Here, Wn∗ is the wealth at
time n resulting from a series of conditionally log-optimal investments,
Wn the wealth from any other non-anticipating portfolio strategy. We
argue that
this is sufficient reason for the investor to use a logarithmic utility

function, i.e. to maximize the expected future logarithmic return given
the past return vectors.
The conditional log-optimal portfolio depends upon the distribution of the re-
turn process {Xn }n. Realistically, the true distribution of the market returns
and hence the log-optimal strategy is not known to the investor. This makes
statistics the natural partner of investment. Statistics is needed to solve the
key problem,
to find a non-anticipating portfolio selection scheme {b̂n }n (working with his-
torical return data only, without knowing the true return distribution) such
that for any stationary ergodic return process {Xn }n, the investor’s wealth
Ŵn := ni=1 < b̂i , Xi > grows –on the average– as fast as with the log-optimum
Q
strategy {b∗n}n . More formally, {b̂n }n should give

1 W∗
lim sup log n ≤ 0
n→∞ n Ŵn
with probability 1.
Such portfolio selection schemes are known to exist (Algoet, 1992). The dis-
advantage is that they are fairly complicated and, even worse, they require
an enormous amount of past return data to yield practically relevant results.
It is the aim of this thesis to provide simplified, yet efficient portfolio selec-
tion algorithms based on nonparametric forecasting and estimation techniques.
Particular emphasis is put on making the algorithms applicable in considerably
large classes of markets.
Chapter 2
Portfolio benchmarking: rates and dimensionality
The performance of a portfolio selection rule is usually compared with that
of a benchmark portfolio selection rule. Our benchmark is the log-optimal
portfolio selection rule, and as we have seen in Chapter 1, this is the optimal
rule. An investor will typically find his own rule underperforming. He can only
10
hope that underperformance vanishes sufficiently fast when – with increasing

number of market periods – his estimates for the distribution of the return
process and hence his idea of the market become more and more complete.
Now, if the investor evaluates the historical returns X1, ..., Xn leading to the
portfolio choice b̂n+1 at time n, he will achieve a return R̂n =< b̂n+1 , Xn+1 > on
his investment during the next market period. This should be compared with
the return Rn∗ =< b∗n+1 , Xn+1 > of the conditional log-optimal portfolio.
From our log-utility point of view we suggest to measure underperfor-

∗
mance of b̂n+1 in terms of the positivity of E log R
R̂
n
. The smaller this
n
expectation becomes, the better is the selection rule b̂n+1 .
Assuming that the return data arises from a process of independent and iden-
tically distributed (i.i.d.) random variables, it is important to know at what
R∗n
rate the underperformance E log R̂ vanishes for typical portfolio selection rules.
n
Using notions from information theory we prove a lower bound on this rate in
Section 2.1. Even in the simplest of all markets, a market with only finitely
many possible return outcomes,
no empirical portfolio selection rule can make underperformance van-

ish in every market faster than n1 tends to 0, i.e. there is always a
∗
market for which the inequality E log R n
R̂n
≥ const. · n1 holds (Theorem
2.1.1).
There are empirical portfolio selection rules that achieve this rate. In particular,
the empirical log-optimal portfolio
n
1X
b̂n+1 := arg max log < b, Xi > (0.0.1)
b∈S n
i=1
proves to be rate optimal in as far as
the empirical log-optimal portfolio selection rule (0.0.1) attains the

lower bound for the rate at which underperformance vanishes, whatever
the number of stocks in the market (Theorem 2.1.3).
Loosely speaking, it compensates for wrong investment decisions as fast as pos-

sible. Interesting enough, the findings are largely unaffected by the number
11
of stocks in the market, which is a rather untypical feature in nonparamet-

ric estimation (Theorem 2.1.4 shows that this phenomenon perseveres in more
complicated market settings).
This is why we discuss the effects of “dimensionality” on the portfolio selection
process in more detail in Section 2.2. We argue that a reduction of the whole
stock market to some pre-selected stocks is inevitable, e.g., because of compu-
tational restrictions. In other words, the investor can only handle a smallish
subset of all stocks in the market for investment strategy planning. These stocks
have to be selected in the planning phase, even before investment starts. Hence,
criteria for the pre-selection of stocks from the market are needed. A common
way to do this is to pick the stocks whose chart promises high growth rates. It
will turn out, however, that this is fallacious:
any selection algorithm that assesses the single stocks seperately, e.g.
on the basis of single stock expected returns, is sure to pick the “bad”
stocks in some realistic market (Theorem 2.2.1).
This is a somewhat negative result, but it warns us that reasonable selection

schemes have to include further information about the market. We will show
that the variance-covariance structure of the stock returns provides sufficient
information in many markets (more precisely, in markets with log-normal re-
turns). Section 2.3 illustrates the results with simulations and examples, demon-
strating their practical relevance.
Chapter 3
Predicted stock returns and portfolio selection
Having gained the insight that variance-covariance information about the mar-
ket (inter-stock correlations as well as temporal correlations) are integral to
successful investment decisions, we move on to particular investment strate-
gies. In Section 3.1 we consider a strategy which is particularly popular among
investors.
The strategy works in two steps, with the past logarithmic returns Yn, Yn−1 , ..., Y0
(Yi := log Xi ) as input data for the investment decision at time n:
1. Produce forecasts of the market future. It is established that forecasts

should be based on conditional expectations of future log-returns given
12
the observed past, i.e. on
Ŷn+1 := E[Yn+1|Yn , Yn−1, ...].
2. Invest in those stocks whose forecast Ŷn+1 promises to beat a riskless

investment in a bond with return rate r, i.e. invest in a stock iff
exp(Ŷn+1) ≥ r.
We will call this strategy a “greedy strategy”, because it tries to single out
the best possible stocks only. As we shall see, this provides us with a natural
strategy which can be applied in markets with low log-return variance (Section
3.1).
The major problem in implementing the greedy strategy is the fact that the
forecasts Ŷn+1 can only be calculated if the distribution of the return process
is known to the investor. Hence, we need to derive an estimate Ê(Yn, ..., Y0)
for the conditional expectation Ŷn+1 = E[Yn+1 |Yn, Yn−1 , ...] from the market
observations Yn , ..., Y0. It is known from literature that no such forecaster can
be strongly consistent in the sense of

lim Ê(Yn, ..., Y0) − E[Yn+1 |Yn, Yn−1, ...] = 0
(0.0.2)
n→∞
with probability 1 for any stationary and ergodic process {Yn}n (Bailey, 1976).
This result is discouraging, but it does not rule out the existence of strongly
consistent forecasting rules for log-return processes as they arise in real financial
markets. In particular, Gaussian log-return processes have been proven to be
a good approximation for real log-return processes, but so far no answer has
been found to the question whether there exist forecasters that are strongly
consistent in any stationary and ergodic Gaussian process. In Section 3.2 we
prove that the answer is indeed affirmative. Under weak extra conditions on
the Wold coefficients of the process
we present a forecaster Ê(Yn, ..., Y0) for stationary and ergodic Gaus-
sian processes which satisfies the strong consistency relation (0.0.2)
and which is remarkably easy to compute (Lemma 3.2.1 and Corollary
3.2.3).
13
This results provides us with the necessary tools to implement the greedy strat-
egy in Gaussian log-return processes. However, the algorithm is of interest very
much in its own right, forecasting problems for Gaussian processes arising in
many areas.
Section 3.3 proves the convergence properties of the algorithm. Application
examples with simulated and real data in Section 3.4 are promising –when the
algorithm is run as a mere forecasting algorithm as well as when the algorithm
is run as a subroutine for the greedy strategy.
Chapter 4
A Markov model with transaction costs: probabilistic view
In simple markets where returns arise as i.i.d. data, the investor should invest
in a constant log-optimal portfolio strategy. This requires him not to change
the proportion of wealth held in each stock during the investment process. The
proportions remain constant, however, the prices of the assets change relatively
to each other during each market period, so that the actual quantities of the
single stocks in the portfolio vary from market period to market period. Thus,
a large number of transactions are needed to follow a constant log-optimal
strategy. In practice, this is a huge drawback: Much of the wealth accumulated
by a log-optimal strategy has to be spent to settle transaction costs such as
brokerage fees, administrative and telecommunication expenses. The conclusion
for the investor must be to adapt his strategy to meet two requirements: to
make as few costly transactions as possible, but to make as many as necessary
to boost his wealth. The aim of Chapters 4 and 5 is to investigate how these
two conflicting requirements can be balanced in one strategy.
To this end we shall assume that the returns arise from a d-stage Markov pro-
cess. In Chapter 4 the distribution of the return process is known, an unrealistic
assumption which we will drop in Chapter 5. Section 4.1 generalizes the mar-
ket model from Chapter 1 to include transaction costs proportional to the total
value of the purchased shares. Not surprisingly, the investor can only afford
a limited range of portfolio choices in presence of transaction costs, and as we
shall see,
in d-stage Markovian return processes it suffices to consider strategies
based on portfolio selection functions, i.e. portfolio selection schemes
of the form bi = c(bi−1 , Xi−d , ..., Xi−1) with an appropriate function c
(Definition 4.1.2).
14
Hence, the next portfolio is a function of the last portfolio and the last d ob-
served return vectors. The investor aims to maximize his expected mean loga-
rithmic return as before by choosing an optimal selection function c.
In Section 4.2 we tackle the problem how to obtain an optimal selection function
c – if the distribution of the return process were known. The main result
demonstrates that
an optimal portfolio selection function c can be obtained from a solu-

tion of the Bellman equation (Theorem 4.2.1, equation 4.2.2).
The Bellman equation is known from the theory of dynamic programming,

but fundamental differences between classical dynamical programming and the
portfolio selection problem will become evident. Further properties of solutions
of the Bellman equation will be derived in Section 4.3, results that will be
needed for the arguments in Chapter 5.
Chapter 5
A Markov model with transaction costs: statistical view
The Bellman equation considered in Chapter 4 heavily depends upon the distri-
bution of the return process {Xn }n through a peculiar conditional expectation.
Hence, the results of Chapter 4 are valid only under the assumption that the in-
vestor knows the distribution of the stock return process. Of course, in practice
this is illusory. At best, the investor has an estimate of the return distribution
at his disposal. This, in turn, allows him to produce an estimate of the con-
ditional expectation in question and hence gives him an approximate Bellman
equation involving the observed empirical return data. Using nonparametric
regression estimation techniques
we will show in Section 5.1 how a natural empirical counterpart of the

Bellman equation from Chapter 4 can be found (equation 5.1.2).
With similar techniques as in Chapter 4 we will establish that this empirical

equation can be solved under realistic conditions.
This will lead us to a strategy that merely relies on observational data

but has the same optimality properties as the (theoretical) optimal port-
folio selection rule in presence of transaction costs (Theorems 5.1.1
and 5.1.2).
15
For this, we will fall back on generalizations of existing uniform consistency

results in regression estimation, which will be provided in Section 5.2. In par-
ticular, if {Xn }n is a stationary geometrically strongly mixing process and g is
taken from a class G of Lipschitz continuous functions we estimate the condi-
tional expectation
R(g, b, x) := E[g(X1, b)|X0 = x] (b ∈ S)
by a kernel regression estimator Rn (g, b, x). Depending on the smoothness of a

density of X0 (which we assume to exist) we determine the rate of convergence
of
sup E sup |Rn (g, b, x) − R(g, b, x)| → 0 (n → ∞),
g∈G x∈X ,b∈S
i.e. of the expected uniform estimation error, uniformly in G (Corollary 5.2.2).

This result is of interest in other areas of nonparametric statistics as well.
Finally, Section 5.3 is devoted to the proof of optimality and combines the
results from Chapter 4 with uniform consistent regression estimation techniques.
Chapter 6
Portfolio selection functions in stationary return processes
Considering the fact that the investor may have reason to believe that the his-
torical return data does not follow a d-stage Markov process in some cases,
we should move on to even more general market models than in the previous
chapters. Ignoring transaction costs, we consider a market whose returns are
merely stationary and ergodic. It is natural for the investor to take his invest-
ment decisions on the basis of recently observed returns, say on the basis of the
returns during the last d ∈ IN market periods (d fixed). This leads us to the
notion of log-optimal portfolio selection functions.
We make this more concrete in Section 6.1, where we take our familiar log-
utility approach again. The investor tries to find a log-optimal portfolio selection
function, i.e. a measurable function
b∗ : IRdm
+ −→ S
such that (< ·, · > denoting the Euclidean scalar product)
E (log < b∗ (X0 , ..., Xd−1), Xd >) ≥ E (log < f (X0 , ..., Xd−1), Xd >)
16
for all measurable f : IRdm ∗

+ −→ S. For the (n + 1)st day of trading, b advises
∗
the investor to acquire the portfolio b (Xn−d+1 , ..., Xn).
Clearly, the concept of log-optimal portfolio selection functions does not reach
the same degree of generality as the concept of a conditional log-optimal port-
folio (where d is such that the whole observed past is included in the portfolio
decision). In spite of being a simplification, this approach nevertheless gives us
several advantages over the log-optimal strategy as far as computation, estima-
tion and interpretation are concerned.
With log-optimal portfolio selection functions we face the same problem as with
log-optimal portfolios. Both can only be calculated if the true distribution of the
return process happens to be known. A practitioner, however, needs to have an
estimation procedure that evaluates observed past return data to approximate
the true log-optimal device.
In Section 6.2 we therefore develop an algorithm to produce estimates

b̂n of a log-optimal portfolio selection function b∗ from past return data.
We require very mild conditions beyond stationarity and ergodicity. More pre-
cisely, we assume that the return process {Xn}∞ m
n=0 is an [a, b] -valued station-
ary and ergodic process (0 < a ≤ b < ∞ need not be known) and that a
Lipschitz condition on the conditional return ratio E[Xd / < s, Xd > |Xd−1 =
xd−1 , ..., X0 = x0] holds. The Lipschitz constant L is taken as a known market
constant.
Using a stochastic gradient algorithm and combining it with nonparametric
regression estimators,
we establish the strong convergence of the estimates b̂n to the true

log-optimal portfolio selection function b∗ , avoiding the usual mixing
conditions (Theorem 6.2.2).
What is even more important in practical applications:
Selecting portfolios on the basis of the estimated log-optimal portfolio

selection functions yields optimal growth of wealth among all other
strategies that take their investment decisions on the basis of the last
d observations.
17
Indeed, let Ŝn be the wealth accumulated during n market periods when on
the (i + 1)st day of trading the portfolio b̂i (Xi−d+1 , ..., Xi) is selected using the
most recent estimate b̂i of a log-optimal portfolio selection function. Then, if
Sn is the wealth accumulated during the same period using any other portfolio
selection function of the last d observed return vectors,
1 Sn
lim sup log ≤0
n→∞ n Ŝn
with probability 1 (Corollary 6.2.3).
After an appropriate modification, the algorithms and the results remain valid
even if the market constant L is unknown in real market applications (Theorem
6.2.4). Section 6.3 proves the findings, and the chapter is rounded off with
several realistic examples in Section 6.4.
Chapters 2, 3 and 6 can be read independently from each other, they are self-
contained. Chapters 4 and 5 are closely linked, however. Notation that goes
beyond common mathematical style is explained where it occurs for the first
time. We also refer the reader to the list of abbreviations at the beginning
of the thesis. The calculations and plots for the examples were generated us-
ing Matlab 4.0 and 6.0.0.88, Minitab 11.2 and R 1.1.1 with historical stock
quotes (daily closing prices) from the New York Stock Exchange provided by
www.wallstreetcity.com.
18
19
ZUSAMMENFASSUNG
Diese Arbeit soll ein Plädoyer sein für die Anwendung nichtparametrischer
statistischer Vorhersage- und Schätzmethoden auf Probleme, wie sie bei der
Planung von Finanzanlagen und Investitionen auftreten.
In sechs Kapiteln werden verschiedene Anwendungsmöglichkeiten nichtparamet-
rischer Techniken bei der Portfolioauswahl an Finanzmärkten analysiert. Dies
kann natürlich nur einen groben und zugegebenermaßen willkürlichen Aus-
schnitt aus diesem weiten Gebiet widerspiegeln –wir hoffen jedoch, dadurch die
Vorzüge nichtparametrischer Schätzmethoden bei der Portfolioauswahl aufzeigen
zu können.
Kapitel 1
Einführung: Investment und nichtparametrische Statistik
Investment ist der strategisch geplante Einsatz von Ressourcen (üblicherweise
von finanziellen Ressourcen) in einer Umgebung (üblicherweise in einem Fi-
nanzmarkt), deren zukünftige Entwicklung zufälligen Fluktuationen unterliegt.
Investitionsprobleme treten in einer Vielzahl von Gebieten auch über den fi-
nanziellen Kontext hinaus auf. Dabei können Ressourcen u. A. die Form
von Energie, von Datenverarbeitungskapazitäten, etc. annehmen. Die strate-
gische Planung von Investitionen hilft, viele Prozesse mit höherem Nutzen zu
betreiben. Diese Arbeit konzentriert sich auf finanzielle Investitionen, welche
gleichsam den “Prototyp” für verschiedenste Prozesse bilden, bei denen System-
ressourcen gewinnbringend einzusetzen sind.
Bei Investitionen finanzieller Natur spielen drei Komponenten eine Rolle: der
Markt, die Handlungsmöglichkeiten des Investors und sein Investitionsziel. Diese
Bausteine werden in den Abschnitten 1.1-1.3 im Detail diskutiert.
– Zum Markt: Wir gehen von einem Finanzmarkt mit m Anlagemöglichkeiten

(Aktien, festverzinsliche Wertpapiere, ...) aus. Die i. Anlagemöglichkeit
erzielt in der Marktperiode n eine Rendite Xi,n auf eine Investition von
20
einer Geldeinheit. Die n. Marktperiode dauere vom “Zeitpunkt” n − 1 bis

zum Zeitpunkt n, wobei die Zeit z.B. in Handelstagen gemessen wird. Die
Renditen der einzelnen Anlagemöglichkeiten am n. Handelstag werden im
Renditevektor
Xn = (X1,n , ..., Xm,n)T ∈ IRm
+
zusammengefasst. In den Augen des Investors ist {Xn }∞ n=1 ein stochasti-
scher Prozess, welcher in vielen realen Märkten stationär und ergodisch
ist (Definition 1.1.1). In manchen Kapiteln dieser Arbeit werden (re-
alistische) Zusatzannahmen über die Verteilung des Prozesses getroffen.
Entscheidend ist dabei jedoch,
dass wir nichtparametrische Modelle betrachten –Modelle also, die

nicht von der Existenz einer parametrischen Entwicklungsgleichung
ausgehen, wie sie z.B. ARMA-, ARCH- und GARCH-Prozesse be-
sitzen.
Diese Modelle garantieren höchste Flexibilität bei der Anwendung in realen

Finanzmärkten.
– Zu den Handlungsmöglichkeiten: Wir betrachten einen Investor, der weder

Teile seines Vermögens auf persönlichen Konsum verwendet, noch seinem
Portfolio im Verlauf des Investitionsprozesses neues Geld zufließen lässt.
Am Beginn jeder Marktperiode n verwendet der Investor sein gesamtes
Vermögen darauf, ein Aktienportfolio bn zu erwerben. Ein solches Portfolio
bn wird durch die Anteile bj,n am aktuellen Gesamtvermögen des Investors
beschrieben, welche in der n. Marktperiode in die Anlegemöglichkeit j =
1, ..., m investiert werden. Die Wahl von bn erfolgt dann aus der Menge S
aller Portfolios, welche aus den Vektoren (Portfolios)
bn = (b1,n, ..., bm,n)T
besteht, für die bj,n ≥ 0 und m

P
j=1 bj,n = 1. In manchen Situationen wird
S weiter durch das Auftreten von Transaktionskosten eingeschränkt.
– Zum Investitionsziel: W0 sei das anfängliche Vermögen des Investors. Ver-

wendet er die Portfoliostrategie {bi }n−1
i=0 , wird er nach n Marktperioden
Qn
über das Vermögen Wn = i=1 < bi, Xi > W0 verfügen (< ·, · > be-
zeichnet das Euklidische Skalarprodukt). Ziel des Investors ist es, für Wn
21
einen möglichst großen Wert zu erzielen. Aus der Literatur ist bekannt,
dass dabei kein grundlegender Konflikt zwischen nahen (n endlich) und
fernen (n → ∞) Investitionshorizonten besteht. In beiden Fällen ist eine
Investition zum Zeitpunkt n gemäß dem bedingt log-optimalen Portfolio
b∗n := arg max E [log < b, Xn >| Xn−1 , ...X1]

b∈S
optimal. Es übertrifft jede andere Strategie indem

Wn 1 Wn
E ≤ 1 und lim sup log ∗ ≤ 0 mit Wahrscheinlichkeit 1
Wn∗ n→∞ n Wn
(Cover and Thomas, 1991, Theorem 15.5.2). Wn∗ ist dabei das Vermö-
gen zum Zeitpunkt n, das der Investor durch eine Serie von bedingt log-
optimalen Investitionen erzielt, Wn das Vermögen mit einer beliebigen an-
deren Portfoliostrategie, die nicht über mehr Information verfügt als aus
vergangenen Marktbeobachtungen ableitbar (eine sogenannte “kausale”
Strategie).
Dies sollte für den Investor Grund genug sein, eine logarithmische
Nutzenfunktion zu verwenden, d.h. mit dem Wissen um die in der
Vergangenheit beobachteten Renditevektoren die Maximierung der er-
warteten zukünftigen logarithmierten Rendite zu betreiben.
Das bedingt log-optimale Portfolio leitet sich aus der Verteilung des Rendite-
prozesses {Xn }n ab. In der Realität ist die wahre Verteilung der Renditen und
damit auch die bedingt log-optimale Strategie dem Investor nicht bekannt. An
diesem Punkt bedarf die Finanzplanung der Statistik als Partner. Die Statistik
dient dem Investor zur Lösung des Problems,
eine Methode zu finden, die nur anhand historischer Renditedaten und ohne
Kenntnis der wahren Renditeverteilung eine optimale kausale Portfoliostrategie
{b̂n }n erzeugt. Optimalität wird hier in dem Sinn verwendet, dass die Strate-
gie für jeden stationären und ergodischen Renditeprozess {Xn}n das Vermö-
gen Ŵn := ni=1 < b̂i, Xi > des Investors im Mittel genauso schnell wach-
Q
sen lässt wie die log-optimalen Strategie {b∗n }n. Formal ausgedrückt soll {b̂n }n
garantieren, dass mit Wahrscheinlichkeit 1
1 W∗
lim sup log n ≤ 0.
n→∞ n Ŵn
22
Es ist bekannt, dass solche Methoden existieren (Algoet, 1992). Diese brin-
gen jedoch den Nachteil mit sich, höchst komplex zu sein und zur Erzeugung
praktisch verwertbarer Ergebnisse eine Unmenge historischer Daten zu benöti-
gen. Ein Ziel dieser Arbeit ist es, vereinfachte, aber effiziente Algorithmen zur
Portfolioauswahl zu entwickeln, die auf nichtparametrischen Vorhersage- und
Schätzverfahren basieren. Die Algorithmen sollen so gestaltet sein, dass sie für
möglichst große Klassen von Märkten anwendbar sind.
Kapitel 2
Der Vergleich von Portfolios: Konvergenzraten und Dimension
Die Güte einer Methode zur Portfolioauswahl wird in der Regel durch den
Vergleich mit einer Referenzstrategie beurteilt. Unsere Referenzstrategie ist
die log-optimale Portfolioauswahl, die –wie wir in Kapitel 1 gesehen haben–
eine optimale Verhaltensregel darstellt. Dem Investor wird es nicht gelin-
gen, letztere zu übertreffen. Natürlich wird er hoffen, dass der Mangel an
Leistungsfähigkeit seiner eigenen Strategie im Verlauf des Investitionsprozesses
verschwindet, wenn nämlich seine Schätzungen für die Verteilung des Ren-
diteprozesses mit zunehmender Menge verfügbarer historischer Daten immer
besser werden. Wählt der Investor zum Zeitpunkt n anhand der Beobachtun-
gen X1 , ..., Xn sein Portfolio, wird er in der nächsten Marktperiode eine Rendite
von R̂n =< b̂n+1, Xn+1 > erwirtschaften, während die log-optimale Strategie
Rn∗ =< b∗n+1 , Xn+1 > liefert. Der Vergleich beider Werte ermöglicht die Ein-
schätzung, um wieviel b̂n+1 der log-optimalen Strategie b∗n+1 unterlegen ist.
Vom Standpunkt einer logarithmischen Nutzenfunktion ist es daher

angebracht, die Unterlegenheit der Strategie b̂n+1 an der Positivität
∗
der erwarteten Differenz der log-Renditen, an E log R
R̂
n
zu messen. Je
n
kleiner dieser Wert, desto besser ist die Strategie b̂n+1 .
Zur Beurteilung der Qualität der Strategie b̂n+1 ist also insbesondere zu analysie-
R∗n
ren, mit welcher Geschwindigkeit E log R̂ gegen Null strebt. Dabei wird davon
n
ausgegangen, dass die Renditen in einem Prozess von unabhängigen, identisch
verteilten Zufallsvariablen auftreten. Unter Verwendung von Konzepten der
Informationstheorie wird in Abschnitt 2.1 eine untere Schranke für diese Kon-
vergenzgeschwindigkeit abgeleitet. Diese besagt, dass selbst im einfachsten aller
Märkte, einem Markt mit nur endlich vielen möglichen Renditekonstellationen
gilt:
23
Es gibt keine Portfolioauswahlregel, die ihre Unterlegenheit im Ver-

gleich zur log-optimalen Strategie in jedem Markt schneller kompen-
siert als n1 gegen Null strebt, d.h. es gibt stets einen Markt, für den
∗
E log R
R̂
n
≥ const. · n1 (Theorem 2.1.1).
n
Es gibt jedoch Portfolioauswahlregeln, die diese Rate erreichen. Insbesondere

das empirisch log-optimale Portfolio
n
1X
b̂n+1 := arg max log < b, Xi > (0.0.3)
b∈S n
i=1
erweist sich hier als günstig:
Das empirische log-optimale Portfolio (0.0.3) erreicht die untere

∗
Schranke für die Konvergenzrate von E log R
R̂
n
(Theorem 2.1.3).
n
Etwas leger ausgedrückt könnte man sagen, dass das empirisch log-optimale
Portfolio seine Defizite mit optimaler Geschwindigkeit wettzumachen vermag.
Die Ergebnisse gelten weitestgehend unabhängig von der Anzahl der Aktien
am betrachteten Markt. Dies ist untypisch für nichtparametrische Schätzver-
fahren und bedarf daher genauerer Diskussion (Theorem 2.1.4 zeigt, dass dieses
Phänomen auch in komplizierter gearteten Märkten auftritt).
Aus diesem Grund schließen wir in Abschnitt 2.2 eine detailliertere Diskus-
sion der Auswirkungen der Dimension des Marktes auf die Portfolioauswahl an.
Beschränkte rechnerische Kapazitäten werden den Investor bei seiner Investi-
tionsplanung dazu zwingen, sich auf eine kleinere Teilmenge aller Aktien am
Markt zu beschränken. Diese Teilmenge muss bereits in der Planungsphase,
also vor dem eigentlichen Investitionsprozess ausgewählt werden. Es werden
Kriterien für diese Vorauswahl benötigt. Üblicherweise würde man vorgehen,
indem man einzelne Aktien auswählt, deren Chart hohe Wachstumspotentiale
versprechen. Es wird gezeigt werden, dass dieser Weg mit substantiellen Un-
zulänglichkeiten behaftet ist:
Jedes Auswahlverfahren, das die einzelnen Aktien getrennt, z.B. an-

hand ihrer erwarteten logarithmierten Rendite, beurteilt, wird mit
Sicherheit in einem realistischen Markt die falsche Auswahl treffen
(Theorem 2.2.1).
24
Dieses negative Resultat zeigt, dass Portfolioauswahlverfahren über die einzel-

nen erwarteten log-Renditen hinausgehende Information benötigen. Die Vari-
anz-Kovarianz-Struktur der Renditen wird in Märkten mit log-normal verteilten
Renditen hinreichend viel Information vermitteln. In Abschnitt 2.3 werden die
Resultate anhand von Simulationen und realen Beispielen illustriert und ihre
praktische Relevanz aufgezeigt.
Kapitel 3
Renditevorhersagen und Portfolioauswahl
Mit der Erkenntnis, dass erfolgreiche Portfolioauswahl Information über die
Varianz-Kovarianz-Struktur der Aktien am Markt bedarf (es spielen sowohl
zeitliche Korrelationen als auch Korrelationen zwischen den einzelnen Aktien
eine Rolle), wird in Abschnitt 3.1 eine Investmentstrategie vorgestellt, die sich
unter den Investoren großer Beliebtheit erfreut.
Die Strategie ist zweistufig und verwendet dabei die historischen log-Renditen
Yn ,Yn−1, ..., Y0 (Yi := log Xi ) als Eingangsdaten für die Investitionsentscheidung
zur Zeit n:
1. Erstelle eine Schätzung für die Zukunft des Marktes. Es wird gezeigt
werden, dass Vorhersagen für den Markt auf bedingten Erwartungen für
zukünftige log-Renditen bei gegebener Vergangenheit basieren sollten,
d.h. auf
Ŷn+1 := E[Yn+1|Yn , Yn−1, ...].
2. Investiere ausschließlich in die Aktien, deren Vorhersagen Ŷn+1 eine bessere

Rendite verheißen als ein festverzinsliches Wertpapier mit Rendite r. In
eine Aktie wird also investiert genau dann, wenn
exp(Ŷn+1) ≥ r.
Wir nennen diese Strategie eine Strategie für den “gierigen Investor”, da sie da-
rauf ausgerichtet ist, nur die bestmöglichen Anlagemöglichkeiten herauszupicken.
Die Einfachheit der Strategie besticht, und in Märkten mit geringer Varianz der
log-Renditen führt sie zu sinnvollen Ergebnissen (Abschnitt 3.1).
Bei der Implementierung der Strategie sieht sich der Investor der Schwierigkeit
gegenüber, dass die Vorhersagewerte Ŷn+1 nur unter Kenntnis der wahren Vertei-
lung des Prozesses berechnet werden können. Daher wird man sich auf die
25
Berechnung einer Schätzung Ê(Yn , ..., Y0) für den bedingten Erwartungswert
E[Yn+1|Yn , Yn−1, ...] aus den Marktbeobachtungen Yn, ..., Y0 beschränken müssen.
Aus der Literatur ist bekannt, dass keine auf solche Weise gewonnene Schätzung
stark konsistent sein kann in dem Sinne, dass

lim Ê(Yn, ..., Y0) − E[Yn+1 |Yn, Yn−1, ...] = 0 (0.0.4)
n→∞
mit Wahrscheinlichkeit 1 für jeden stationären und ergodischen Prozess {Yn}n

gilt (Bailey, 1976). Dieses Resultat ist einerseits entmutigend, andererseits
schließt es nicht aus, dass stark konsistente Vorhersagemechanismen für loga-
rithmierte Renditeprozesse existieren, wie sie in realen Finanzmärkten auftreten.
Dabei ist insbesondere an Gaußsche log-Renditeprozesse zu denken, die eine
gute Approximation für reale log-Renditeprozesse liefern. Bis jetzt jedoch war
die Frage unbeantwortet, ob für stationäre und ergodische Gaußsche Prozesse
stark konsistente Vorhersagealgorithmen existieren. Abschnitt 3.2 wird nun
eine positive Antwort darauf geben können. Unter schwachen Zusatzvorausset-
zungen an die Wold-Koeffizienten des Prozesses
wird ein Vorhersagealgorithmus Ê(Yn, ..., Y0) für stationäre und er-
godische Gaußsche Prozesse entwickelt, der stark konsistent gemäß
(0.0.4) ist und der bemerkenswert einfach zu implementieren ist
(Corollary 3.2.3).
Diese Ergebnisse geben uns die Subroutinen an die Hand, um die Strategie für
den “gierigen” Investor in Gaußschen log-Renditeprozessen umzusetzen. Der
Algorithmus selbst ist jedoch auch unabhängig von seiner hier gegebenen An-
wendung von Interesse, treten Vorhersageprobleme für Gaußsche Prozesse doch
in einer Vielzahl von Gebieten auf.
Der Beweis der Konvergenzeigenschaften wird in Abschnitt 3.3 geführt. Anwen-
dungsbeispiele mit realen und simulierten Daten schließen sich in Abschnitt 3.4
an und zeigen vielversprechende Ergebnisse, wenn der Algorithmus zur reinen
Vorhersage, aber auch als Subroutine für die “gierige” Strategie dient.
Kapitel 4
Ein Markov-Modell mit Transaktionskosten: stochastische Aspekte
In den einfachsten Märkten, in denen die Renditen als unabhängige, iden-
tisch verteilte Zufallsvariablen auftreten, sollte in ein zeitlich konstantes log-
26
optimales Portfolio investiert werden. Bei Verwendung eines zeitlich konstan-

ten Portfolios verwendet man auf jede Aktie einen gleichbleibenden Anteil des
aktuellen Gesamtvermögens. Der Anteil bleibt somit derselbe, bedingt durch
die Änderung der Aktienpreise zueinander ändert sich jedoch die tatsächliche
Anzahl an gehaltenen Aktien von Marktperiode zu Marktperiode. Zur Durch-
führung einer log-optimalen Strategie wird somit eine große Anzahl an Transak-
tionen notwendig. In der Realität stellt dies einen nicht zu unterschätzenden
Nachteil dar. Was immer an Vermögen anwächst, ein Großteil der Gewinne wird
zur Begleichung von Transaktionskosten wie Maklerprovisionen, Verwaltungs-
und Kommunikationskosten wieder abfließen. Folglich muss der Investor seine
Strategie diesen Gegebenheiten anpassen: Er muss so wenige kostenintensive
Transaktionen wie möglich machen, aber doch so viele, um ein gutes Wertwachs-
tum zu erzielen. Kapitel 4 und 5 widmen sich der Frage, wie diese beiden
Anforderungen in einer Strategie miteinander vereinbart werden können.
Zu diesem Zweck nehmen wir an, dass die Renditen sich gemäß einem d-stufigen
Markovschen Prozess entwickeln. In Kapitel 4 arbeiten wir unter der Prämisse,
dass die Verteilung des Renditeprozesses bekannt ist, eine unrealistische An-
nahme, die wir in Kapitel 5 fallen lassen werden. Zunächst wird in Abschnitt
4.1 das Marktmodell aus Kapitel 1 um Transaktionskosten erweitert, die pro-
portional zum Volumen gekaufter Aktien anfallen. Es ist nicht überraschend,
dass sich der Investor in einer solchen Situation nur eine eingeschränkte Menge
von Portfoliozusammenstellungen leisten kann, ohne bankrott zu gehen. Es
wird deutlich werden,
dass es in d-stufigen Markovschen Renditeprozessen ausreicht, Strate-

gien zu betrachten, die auf Portfolioauswahlfunktionen beruhen, d.h.
Strategien der Form bi = φ(bi−1 , Xi−d , ..., Xi−1) mit einer geeigneten
Funktion φ (Definition 4.1.2).
Das nächste zu wählende Portfolio ist somit eine Funktion des letzten gewählten
Portfolios und der letzten d am Markt beobachteten Renditevektoren. Wie
zuvor strebt der Investor danach, sein zu erwartendes logarithmiertes Vermö-
genswachstum zu maximieren, hier nun indem er eine optimale Portfolioaus-
wahlfunktion φ wählt.
Abschnitt 4.2 legt dar, wie eine optimale Auswahlfunktion c konstruiert werden
kann – alles unter der Prämisse, dass die wahre Verteilung der Renditen bekannt
wäre. Das Hauptresultat wird zeigen,
27
dass eine optimale Portfolioauswahlfunktion φ aus einer Lösung der

Bellman-Gleichung konstruiert werden kann (Theorem 4.2.1, Glei-
chung 4.2.2).
Die Bellman-Gleichung ist aus der Theorie der dynamischen Optimierung wohl-
bekannt, dennoch werden sich fundamentale Unterschiede zwischen klassischer
dynamischer Optimierung und dem Portfolioauswahl-Problem zeigen. Zur Vor-
bereitung auf Kapitel 5 werden in Abschnitt 4.3 schließlich weitere analytische
Eigenschaften der Lösung der Bellman-Gleichung abgeleitet.
Kapitel 5
Ein Markov-Modell mit Transaktionskosten: statistische Aspekte
Die Bellman-Gleichung, wie sie in Kapitel 4 aufgestellt wurde, hängt entschei-
dend von der Verteilung des Renditeprozesses {Xn }n ab. Diese Abhängigkeit
besteht in Form eines zu evaluierenden bedingten Erwartungswertes. Aus diesem
Grund sind die Ergebnisse von Kapitel 4 nur unter der Prämisse gültig, dass
der Investor die wahre Verteilung des Renditeprozesses kennt, was in der Praxis
natürlich illusorisch ist. Bestenfalls verfügt der Investor über eine Schätzung der
Verteilung der Renditen. Diese ermöglicht es ihm, eine Schätzung für bewussten
bedingten Erwartungswert zu berechnen, welche ihm dann eine Näherung der
Bellman-Gleichung liefert. Mit Hilfe von Techniken aus der nichtparametrischen
Regressionsschätzung
wird in Abschnitt 5.1 gezeigt, dass zur Bellman-Gleichung aus Kapi-

tel 4 eine natürliche empirische Entsprechung basierend auf Markt-
beobachtungen existiert (Gleichung 5.1.2).
Ähnliche Schlussweisen wie in Kapitel 4 werden es uns ermöglichen, diese em-

pirische Bellman-Gleichung unter realistischen Bedingungen zu lösen.
Das wird zu einer Strategie führen, die ausschließlich auf historischen

Renditen basiert, dabei jedoch dieselben Optimalitätseigenschaften wie
die (theoretisch) optimale Portfolioauswahlstrategie unter Transak-
tionskosten hat (Theoreme 5.1.1 und 5.1.2).
In den Betrachtungen von Kapitel 5 werden wir auf Verallgemeinerungen von

bekannten Resultaten über die gleichmäßige Konvergenz von Regressionsschät-
zern zurückgreifen. Diese Verallgemeinerungen werden in Abschnitt 5.2 her-
geleitet. Ist z.B. {Xn}n ein stationärer Prozess, welcher die geometrischen
28
Mischungseigenschaft hat, und ist g aus einer Klasse G lipschitzstetiger Funk-

tionen gewählt, schätzen wir den bedingten Erwartungswert
R(g, b, x) := E[g(X1, b)|X0 = x] (b ∈ S)
durch einen Kernschätzer Rn (g, b, x). In Abhängigkeit von der Glattheit einer
Dichte von X0 (wir nehmen an, dass eine solche existiert) wird die Konvergenz-
geschwindigkeit in der Limesrelation
sup E sup |Rn (g, b, x) − R(g, b, x)| → 0 (n → ∞)

g∈G x∈X ,b∈S
bestimmt. Dabei wird der zu erwartende maximale Schätzfehler gleichmäßig in

der Klasse G betrachtet (Corollary 5.2.2). Das erhaltene Resultat ist nicht nur
im Hinblick auf unsere Anwendung von Interesse, sondern auch darüber hinaus
als unabhängiges Resultat in der nichtparametrischen Regressionsschätzung.
Abschnitt 5.3 schließlich widmet sich dem Beweis der Optimalitätseigenschaften
des Algorithmus und kombiniert dabei die Ergebnisse aus Kapitel 4 mit den
Ergebnissen zur gleichmäßig konsistenten Regressionsschätzung.
Kapitel 6
Portfolioauswahlfunktionen in stationären Renditeprozessen
An realen Finanzmärkten beobachtet man unter Umständen eine Abweichung
des Renditeprozesses {Xn }n von einem d-stufigen Markov-Prozess. Deshalb
werden in diesem Kapitel noch allgemeinere Marktmodelle zu betrachten sein.
Transaktionskosten werden dabei ignoriert, dafür aber Renditeprozesse betrach-
tet, für die im Wesentlichen nur Stationarität und Ergodizität vorausgesetzt
wird. Für den Investor ist es naheliegend, seine Investitionsentscheidungen an-
hand der letzten d am Markt beobachteten Renditevektoren (d fest) zu treffen.
Dies führt zum Konzept von log-optimalen Portfolioauswahlfunktionen.
Dieses Konzept wird in Abschnitt 6.1 eingeführt. Der Investor verwendet wieder
eine logarithmische Nutzenfunktion und versucht daher, eine log-optimale Port-
folioauswahlfunktion zu ermitteln, d.h. eine messbare Funktion
b∗ : IRdm
+ −→ S,
so dass (< ·, · > bezeichnet das euklidische Skalarprodukt)
E (log < b∗ (X0 , ..., Xd−1), Xd >) ≥ E (log < f (X0 , ..., Xd−1), Xd >)
29
für alle messbaren Funktionen f : IRdm

+ −→ S. Für die (n + 1). Marktperiode
legt b dem Investor nahe, das Portfolio b∗(Xn−d+1 , ..., Xn) zu erwerben.
∗
Das Konzept log-optimaler Portfolioauswahlfunktionen bleibt in seiner Allge-

meinheit hinter dem Konzept des bedingt log-optimalen Portfolios zurück (dieses
wählt den Parameter d so, dass die ganze Vergangenheit des Prozesses in die
Portfolioauswahl einbezogen wird). Obwohl es sich in diesem Sinn um eine
Vereinfachung handelt, vereinigen log-optimale Portfolioauswahlfunktionen im
Vergleich zum log-optimalen Portfolio einige Vorteile auf sich, insbesondere was
Berechnung, Schätzung und Interpretation angeht.
Bei log-optimalen Portfolioauswahlfunktionen sieht sich der Investor demselben
Problem gegenüber wie bei der Verwendung log-optimaler Portfolios. Beide
können nur berechnet werden, wenn die wahre Verteilung des Renditeprozesses
bekannt sein sollte. In der Praxis ist dies nicht der Fall, und man benötigt wieder
eine Schätzprozedur, die eine log-optimale Portfolioauswahlfunktion anhand in
der Vergangenheit beobachteten Renditedaten annähert.
In Abschnitt 6.2 wird deshalb ein Algorithmus entwickelt, der

Schätzungen b̂n für eine log-optimale Portfolioauswahlfunktion b∗ aus
historischen Renditedaten berechnet.
Über Stationarität und Ergodizität hinaus werden dabei sehr milde Zusatzvo-
raussetzungen getroffen, konkret wird davon ausgegangen, dass der Renditepro-
zess {Xn }∞ m
n=0 ein [a, b] -wertiger stationärer und ergodischer stochastischer Pro-
zess ist (0 < a ≤ b < ∞ brauchen nicht bekannt zu sein) und dass eine
Lipschitzbedingung für den bedingten Renditequotienten E[Xd / < s, Xd >
|Xd−1 = xd−1 , ..., X0 = x0 ] gilt. Die Lipschitzkonstante L sei dabei eine bekan-
nte Marktkonstante.
Mit Hilfe eines stochastischen Gradientenverfahrens und Methoden der nicht-
parametrischen Regressionsschätzung wird gezeigt,
dass die Schätzungen b̂n mit Wahrscheinlichkeit 1 gegen die wahre

log-optimale Portfolioauswahlfunktion b∗ konvergieren, wobei die in
der Literatur typischerweise vorausgesetzten Mixing-Bedingungen ver-
mieden werden (Theorem 6.2.2).
In der praktischen Anwendung spielt das folgende Resultat eine noch wichtigere
Rolle:
30
Eine Portfolioauswahl anhand der geschätzten log-optimalen Portfolio-

auswahlfunktionen liefert ein optimales Vermögenswachstum unter
allen Strategien, die ihre Investitionsentscheidungen anhand der letz-
ten d am Markt beobachteten Renditen treffen.
Sei Ŝn das Vermögen, das man nach n Marktperioden erzielt hat, wenn man
am (i + 1). Handelstag die aktuelle Schätzung b̂i verwendet, um das Portfo-
lio b̂i (Xi−d+1 , ..., Xi) zu wählen. Wenn Sn das Vermögen angibt, das man in
derselben Zeit mit einer beliebigen anderen Auswahlstrategie basierend auf den
jeweils letzten d beobachteten Renditen erwirtschaftet, so ist
1 Sn
lim sup log ≤0
n→∞ n Ŝn
mit Wahrscheinlichkeit 1 (Corollary 6.2.3).
Nach einer geeigneten Modifikation behalten die Algorithmen und die Resultate
ihre Gültigkeit, selbst wenn –wie in der Anwendungspraxis– die Marktkonstante
L dem Investor unbekannt ist (Theorem 6.2.4). Abschnitt 6.3 beweist die Resul-
tate, und das Kapitel wird mit mehreren realistischen Beispielen in Abschnitt
6.4 abgerundet.
Die Kapitel 2, 3 und 6 können unabhängig voneinander gelesen werden, sie sind
in sich abgeschlossen. Kapitel 4 und 5 sind jedoch eng verzahnt. Notationen,
die über die mathematische Standardnotation hinausgehen, werden bei ihrem
ersten Auftreten erklärt. Der Leser sei auch auf das Abkürzungsverzeichnis am
Anfang dieser Arbeit verwiesen. Berechnungen und Schaubilder für die Beispiele
wurden mit Matlab 4.0 und 6.0.0.88, Minitab 11.2 sowie R 1.1.1 erzeugt, wobei
die historischen Kursnotierungen (tägliche Schlusskurse) der New Yorker Börse
von www.wallstreetcity.com Verwendung fanden.
31
ACKNOWLEDGEMENTS
I am endebted to
Prof. Harro Walk, who suggested to investigate the subject. He advised me on

many points, always found the time to discuss the results, and more than
once I benefitted from his extensive knowledge.
Prof. Laszlo Györfi, for his hospitality during my visits to the Technical Uni-
versity of Budapest. On several occasions he gave me the right impulse
and really useful advice.
Prof. Volker Claus, for his interest in my work and for discussing the contents
of this thesis with me.
The DFG and the College of Graduates ”Parallel and Distributed Systems”, for
funding my research with everything that involves.
Dr. Michael Kohler, who introduced me to nonparametric curve estimation.

His expertise in this field was an invaluable source.
Dr. Jürgen Dippon, who never threw me out when I felt like discussing prob-
lems in mathematical statistics and finance.
Prof. Adam Krzyżak, for being my host during a stay at Concordia University,
Montréal, for many discussion about mathematical and other interesting
subjects.
32
33
CHAPTER 1
Introduction: investment and nonpara-

metric statistics
Investment is the strategic allocation of resources, typically of monetary re-
sources, in an environment, typically a market of assets, whose future evolution
is uncertain. This definition leaves much room for subjective interpretation. In
particular, the following points have to be made more precise:
– What market is under consideration? This involves specifying and stan-

dardizing the assets traded in the market (e.g. stocks, bonds, options,
futures, currencies, gold, oil, ...) as well as setting up a reference system
for pricing the assets (e.g. closing or opening prices at the New York
Stock Exchange, world market price for raw materials, ...).
– What actions and instruments may be applied by the investor? Possible

actions may be restricted by exogenous terms and regulations of trade
(e.g. transaction costs, brokerage fees, trading limitations) or personal
preferences (e.g. to rule out borrowing money or short positions in stocks).
– What investment goal is pursued by the investor? Traditionally, the goal

is the maximisation of a personal utility function of the returns on the
allocated resources. The market being chancy, individual risk aversion
preferences may enter the form of the utility function, or restrictions are
imposed on the set of possible investment actions.
Thus, “investment” becomes a highly subjective term, including investment as

it is understood in this thesis. In the following we set up the specific invest-
ment scenario as we shall consider it in this thesis. We believe this scenario
is broadly accepted as the typical setting for investment analysis, although we
34 Chapter 1. Introduction: investment and nonparametric statistics
do not deny that particular investment situations require further adaptation

and modification. It should also be pointed out that, as future asset prices are
subject to random fluctuations, “investment” is a good deal about decision tak-
ing under uncertainty, which makes mathematical statistics the natural partner
of investment (an observation that may be attributed to the groundbreaking
work of Bachelier, 1900, who used statistics to compare his theoretical model
with real market data). An economist will find the economic side of this thesis
to be lacking. There are excellent books on investment science from a more
economic point of view (e.g. Francis, 1980; Luenberger, 1998), but most of
them are lacking in statistical depth. This thesis is about investment from a
decisively statistical point of view – we can therefore only superficially touch
upon economic issues.
1.1 The market model

We consider a market in which m assets (which we will think of as stocks and
bonds) are traded. Taking a macroeconomic point of view, the prices of the
assets (stock quotes, bond values) are generated under the authority of the
market as a whole, i.e. by the large ensemble of investors. We assume that for
the individual investor there is no way to influence the prices by launching spe-
cific investment actions or distributing insider or side information of whatever
kind. In this situation, let P1,n , ..., Pm,n > 0 be the prices of the assets 1, ..., m
at the beginning of market period n (market period n lasts from “time” n − 1
to n, time being measured, e.g., in days of trading). To the “powerless” individ-
ual investor described above, the asset prices present themselves as a random
process on a common probability space (Ω, A, P).
The return of an investment of 1 unit of money in asset i at time n − 1 yields
a return
Pi,n
Xi,n :=
Pi,n−1
during the subsequent market period. We collect the returns of the single assets
in a return vector
Xn := (X1,n , ..., Xm,n)T .
We will often work with the log-returns
log Xn := (log X1,n , ..., log Xm,n )T .

1.1 The market model 35
The return process {Xn }∞ ∞

n=1 and the log-return process {log Xn }n=1 are
stochastic processes on (Ω, A, P).
In most of our investigations we will assume that the return process {Xn }n is
stationary and ergodic in the sense of the following definition (Stout, 1974, Sec.
3.5; Shiryayev, 1984, V §3):
m
Definition 1.1.1. Let {Xn }∞
n=1 be an IR -valued stochastic process on a prob-
ability space (Ω, A, P).
1. {Xn }∞
n=1 is called stationary, if
P(Xi ,...,Xj ) = P(Xi+t ,...,Xj+t )
for all integers i, j, t with i ≤ j.
2. A ∈ A is called an invariant event of {Xn}∞

n=1 , if there exists a B ∈
B((IRm )∞ ) such that
A = {Xi, Xi+1 , ...}−1(B)
for all i ∈ IN.
3. A stationary process {Xn }∞n=1 is called ergodic, if the probability of any

invariant event of {Xi}∞
i=1 is either 0 or 1.
Stationarity preserves the stochastic regime over time, ergodicity is the setting
where time averages along trajectories of the process converge almost surely to
expected values under the process distribution:
Theorem 1.1.2. (Birkhoff Ergodic Theorem, Stout, 1974, Sec. 3.5) Let
m
{Xn }∞n=1 be an IR -valued stationary and ergodic stochastic process on a prob-
ability space (Ω, A, P) with E|X1 | < ∞. Then
n
1X
Xi −→ EX1
n i=1
P-almost surely (P-a.s.), i.e. for all ω ∈ Ω from a set of probability 1.
Stationarity and ergodicity are the basic assumptions for most statistical inves-
tigations. The stationarity of stock returns is a thoroughly investigated field,
both by economists (e.g. Francis, 1980, A24-1) and statisticians (e.g. Franke et
al., 2001, Sec. 10.6). It is natural to assume that there is short term stationar-
ity in most stock returns, some authors (Francis, 1980) even claim that return
data may be treated as stationary if the time horizon comprises at least one
complete business cycle. There is no conclusive answer that proves or disproves
stationarity for the majority of stock markets, and it seems as though this has
to be decided from case to case. We accept stationarity as a working hypothesis,
accounting for the fact that it is common practice to assess and compare the
performance of statistical methods in the stationary setting.
Not much is known about the ergodic properties of stock quotes or stock re-
turns, neither from the theoretical economist’s point of view, nor from empirical
studies. There are indications that the ergodic properties of a market depend
very much upon the flow of information in the market and on the microeconomic
price generation (Donowitz and El-Gamal, 1997). These are difficult to assess,
and so the typical approach has become to derive algorithms under ergodic
hypotheses and then let the success of the algorithm justify the hypotheses.
Throughout this thesis we consider nonparametric models for {Xn }n, i.e.
models that do not require a parametrized evolution equation (in contrast to
MA, AR, ARMA, ARIMA, ARCH and GARCH models, cf. Brockwell and
Davis, 1991, Franke et al., 2001). The nonparametric approach guarantees
highest flexibility in modelling, skipping model parameters which otherwise
require extensive diagnostic model testing. To be more precise, the following
models will be investigated in this thesis:
1. {Xn }n is a sequence of independent identically distributed (i.i.d.) random

variables (e.g. with finitely many outcomes) – Chapter 2.1.
2. The conditional distribution of Xn+1 given Xn , ..., X1 (which we will de-

note by PXn+1|Xn ,...,X1 ) is log-normally distributed (i.e. Plog Xn+1|Xn ,...,X1 has
a normal distribution) – Chapter 2.2.
T
3. {log Xn }n is a stationary Gaussian time series (i.e. (log Xn+k , ..., log XnT ))
follows a multivariate normal distribution which depends upon k but not
upon n) – Chapter 3.
4. {Xn }n is a Markov process of order d (i.e., we assume PXn+1 |Xn ,...,X1 =

PXn+1 |Xn ,...,Xn−d+1 ) – Chapters 4 and 5.
1.2 Portfolios and investment strategies 37
5. {Xn }n is a stationary and ergodic time series – Chapter 6.
Each of these models has been found useful for describing asset return data in
real financial markets. Model 1 is the Cox-Ross-Rubinstein model (Cox et al.,
1979; Francis, 1980, A24-1 and A24-2; Luenberger, 1998, Ch. 11; Franke et al.,
2001, Ch. 7). Models 2 and 3 are models with log-normal returns (Francis, 1980,
A24-1; Luenberger, 1998, Ch. 11) which arise, e.g., from a discretisation of the
Black-Scholes model (Luenberger, 1998, Ch. 11; Korn and Korn, 1999, Kap.
II). In contrast to the classical Black-Scholes model we allow for autocorrelated
log-returns in Chapter 3 (i.e. Cov(log Xn , log Xn+k ) 6= 0 for some k > 0). In
practice, autocorrelation of the log-returns manifests itself for small time lags
k (Franke et al., 2001, Ch. 10) as well as large k (long range dependence, Ding
et al., 1993; Peters, 1997). Many studies have indicated that the logarithms
of stock returns slightly depart from a Gaussian distribution (e.g. by heavy
tails, Mittnik and Rachev, 1993; McCulloch, 1996; Franke et al., 2001, Ch. 10
and the references there). It is therefore advisable to drop the assumption of
log-normality of the stock returns wherever possible. This is done in models 4
and 5, model 4 capturing the autocorrelation of stock returns by the Markov
property.
We will assume that the asset returns correspond to one of the models 1-5.
However, we do not assume the exact form of the true return distribution to be
known to the investor (with the exception of Chapter 4). Hence, the investor has
to apply statistical estimation and forecasting techniques for strategy planning.
Clearly, nonparametric models require nonparametric statistical methods and
arguments are usually more involved than in the parametric setting (for an
introduction to nonparametric estimation as we will use it see Györfi et al.
1989, 2002). Unfortunately, nonparametric methods are not yet common in
econometrics and financial mathematics (Pagan and Ullah, 1999, and Franke
et al., 2001, being two of the few notable exceptions). In this thesis we aim to
demonstrate what powerful impetus nonparamtric statistical estimation may
give to investment strategy planning.
1.2 Portfolios and investment strategies

Having chosen a market model, we turn to the actions that may be taken by
the investor. Throughout the investment process, the investor holds varying
portfolios of the m assets. Taking a discrete time trading point of view, we
assume that the investor is only allowed to rebalance his portfolio at the be-
ginning but not in the course of each market period. The portfolio held at the
beginning of market period n (i.e. from time n − 1 to n) can be given by the
quantities q1,n−1 , ..., qm,n−1 of the single assets owned by the investor (qi,n−1 < 0
corresponds to borrowed assets, so-called short positions). The investor then
enters the nth market period with a portfolio value of
m
X
+
Wn−1 := Pi,n−1 qi,n−1 .
i=1
The remaining value at the end of the market period is

m
X m
X
Wn− := Pi,n qi,n−1 = Xi,n Pi,n−1 qi,n−1 .
i=1 i=1
+
Hence, if Wn−1 6= 0, the portfolio achieved a return of
m
Wn− X
+ = Xi,n bi,n (1.2.1)
Wn−1 i=1
with
Pi,n−1 qi,n−1
bi,n := Pm .
j=1 Pj,n−1 qj,n−1
Note that m
P
i=1 bi,n = 1, and we will find it more convenient to denote a portfolio
by the portfolio vector
bn := (b1,n, ..., bm,n)T
rather than listing q1,n−1 , ..., qm,n−1 . If the investor is allowed to consume an
amount cn before changing his portfolio bn for bn+1 and entering market period
n + 1, then Wn+ is given by
Wn+ = Wn− − cn . (1.2.2)
(1.2.1) and (1.2.2) are the equations governing general discrete time investment.
Throughout this thesis we are concerned with an investor who neither consumes
nor deposits new money into his portfolio but reinvests his current portfolio
1.2 Portfolios and investment strategies 39
market time n
period n = end of nth day of trading
}
X1 Xn-1 Xn Xn+1 return
0 1 n-1 n n+1 time
b1 bn-1 bn bn+1 portfolio

W0 W1 Wn-1 Wn Wn+1 wealth
Figure 1.1: Setting for the return and portfolio processes.
value in each market period. Hence, cn = 0 for all n and (1.2.1) and (1.2.2) boil
down to n
Y
Wn := Wn+ = Wn− = W0 < bi, Xi >, (1.2.3)
i=1
where Wn is the current wealth of the investor at time n. Moreover, the factor
< bn , Xn >:= bTn Xn can be interpreted as the portfolio return during the nth
market period.
Moreover, we assume that the investor never enters short positions, i.e. bi,n ≥ 0.
Then bi,n is the proportion of the current wealth Wn invested in asset i at time
n − 1. The portfolio vector bn chosen at time n is a member of the simplex

m
( )
X
T
S := (s1 , ..., sm) si = 1, si ≥ 0 .

i=1
The choice of bn depends on the information In which the investor can access at
time n and which he deems relevant. Thus bn = bn (In) where In typically com-
prises a number of past observed asset returns (a substring of X1 , ..., Xn−1), in
some cases additional side or insider information about the market and external
economic factors. For specific choices of In , this is the setting for Chapters 2, 3
and 6 (Figure 1.1).
In reality, the range of portfolio choices is further narrowed by the occurrence of
transaction costs. Each transaction in a real market (purchase, sale of assets)
generates costs (brokerage fees, commission, administrative and communication
expenses). The total amount of these fees is withdrawn from the investor’s
wealth. Thus, the range of portfolios the investor may choose is restricted to
those portfolios whose acquisition generates no more transaction costs than

the investor’s current wealth. Then, roughly speaking, the investor is caught
between making as few costly transactions as possible on the one hand and
making as many transactions as necessary to boost his wealth on the other
hand. No wonder that strategic planning under transaction costs requires much
deeper arguments and has received considerable attention in literature for both,
discrete and continuous time models (see e.g. Blum and Kalai, 1999; Bobryk
and Stettner, 1999; Cadenillas, 2000; Bielecki and Pliska, 2000). We shall return
to a typical case of transaction costs in more detail in Chapters 4 and 5.
1.3 Pleading for logarithmic utility

As can be seen from (1.2.3), invested money grows multiplicatively, as a product
of daily returns. Suppose the investor wants to maximize the expected value
of his terminal wealth. If the daily returns {Xi }i are stationary, maximization
of the single expected daily returns is not appropriate. It does not capture
autocorrelation in the returns, since, in general,
n
Y n
Y
E < bi , Xi >6= E < b i , Xi > .
i=1 i=1
The expectation is rather determined by the expectation of the logarithmic daily

returns, since by Taylor expansion (hot. denoting terms of order 2 and higher)
n
Y n
Y n
X
E < bi , Xi >= 1+E log < bi, Xi > +hot. = 1+ E log < bi , Xi > +hot.
i=1 i=1 i=1
It is widely accepted that for returns below 10% and high frequency data (e.g.
daily returns) the logarithmic approximation is convincing (Franke et al., 2001,
Sec. 10.1). This leads to the notion of log-optimal portfolios, i.e. portfo-
lios that maximize the expected logarithmic utility of the investors’s growth
of wealth. The log-optimal portfolio of a process {Xn }n of independent and
identically distributed (i.i.d.) returns Xn is defined as
b∗ := arg max E(log < b, X1 >). (1.3.1)

b∈S
Log-optimal portfolios have been suggested first by Kelly (1956), Latané (1959)
and Breiman (1961) as diversification strategy for investment in a speculative
1.3 Pleading for logarithmic utility 41
market given by a process {Xn}∞ n=1 of i.i.d. return vectors. Since then, nu-
merous investigations, notably by Cover (e.g. Cover, 1980, 1984; Cover and
Thomas, 1991) and Algoet (e.g. Algoet and Cover, 1988) have explored the
theoretical aspects of this strategy, establishing that investment in log-optimal
portfolios yields optimal asymptotic growth rates for the invested wealth. An
introduction, various results and sources of reference can be found in Cover and
Thomas (1991, Chapter 15). There, for stationary and ergodic return processes
{Xn }n , (1.3.1) is generalized by the conditional log-optimal portfolio (for
the nth investment step)
b∗n := arg max E[log < b, Xn > |Xn−1 , ..., X1]

b∈S
in stationary ergodic return processes (conditioning being void for n = 1). The
conditional log-optimal portfolio is the log-optimal portfolio under the condi-
tional distribution PXn |Xn−1 ,...,X1 and hence a random variable. The log-optimal
investment strategy b∗1 , b∗2, ... is a member of the class of non-anticipating
strategies, i.e. sequences of S-valued random variables b1 , b2, ... with the prop-
erty that each bn is measurable w.r.t. the σ-algebra generated by X1 , ..., Xn−1
(hence the strategy requires no more information than available at time n).
The technical aspects of conditional log-optimal portfolios (we will often drop
“conditional” for brevity) are well explored:
Existence and uniqueness of the log-optimal portfolio has been investigated in
Österreicher and Vajda (1993) and Vajda and Österreicher (1994), correcting a
wrong criterion used in Algoet and Cover (1988). The main result is
Theorem 1.3.1. (Vajda and Österreicher, 1994) Let X = (X1 , ..., Xm) be
a stock market return vector with distribution PX . Then there exists a log-
optimal portfolio b∗ ∈ S with |E log < b∗ , X > | < ∞ if and only if

X m
E log Xi < ∞.

i=1
b∗ is unique if PX is not confined to a hyperplane in IRm containing the diagonal

{(d, ..., d) ∈ IRm |d ∈ IR}.
A good algorithm for the calculation of a log-optimal portfolio from the (known)
distribution PX of the return vector X was given by Cover (1984).
Theorem 1.3.2. (Cover, 1984) Assume the support of PX is of full dimension

in [0, ∞)m and choose some b0 ∈ S with non-zero entries. Then the recursively
generated portfolio vectors bk = (bk1 , ..., bkm) with
Xi
bk+1
i = bki · E
< bk , X >
converge to the log-optimal portfolio b∗ as k → ∞.
This is closely linked with the following, the Kuhn-Tucker conditions for a log-
optimal portfolio.
Theorem 1.3.3. (Cover and Thomas, 1991, Theorem 15.2.1) A portfolio

vector b∗ = (b∗1 , ..., b∗m) ∈ S is a log-optimal portfolio for the return vector
X = (X1 , ..., Xm) if and only if it satisfies the conditions
(
Xi = 1 if b∗i > 0,
E ∗
< b ,X > ≤ 1 if b∗i = 0.
The superiority of investment according to conditionally log-optimal portfolios

rests upon the following theorem (Algoet and Cover, 1988; Cover and Thomas,
1991, Theorem 15.5.2).
Theorem 1.3.4. (Algoet and Cover, 1988) Assume the return vectors {Xi}∞ i=1
form a stationary and ergodic process. Let Sn∗ := ni=1 < b∗i , Xi > be the wealth
Q
at time n resulting from a series of conditionally log-optimal investements, Sn

the wealth from any other non-anticipating portfolio strategy (both starting
with 1 unit of money). Then
Sn 1 Sn
E ≤ 1 for all n and lim sup log ∗ ≤ 0 with probability 1. (1.3.2)
Sn∗ n→∞ n Sn
The second part of (1.3.2) can be interpreted in various ways:
– It proves that, eventually for large n, Sn < exp(n)Sn∗ whatever > 0,

which means that no non-anticipating strategy can infinitely often exceed
the log-optimal strategy by an amount that grows exponentially fast (i.e.

an amount that couldn’t be compensated for by investment in a fixed
interest rate bank account).
– It proves that the log-optimal portfolio will do at least as well as any other
non-anticipating strategy to first order in the exponent of capital growth,
i.e. it guarantees Sn∗ = exp (nW + o(n)) with highest possible rate W .
From the first part of (1.3.2) Bell and Cover (1988) conclude that
– there is no essential conflict between good short-term and long-run per-

formance. Both are achieved by maximizing the conditional expected
log-return.
The log-optimality criterion has not been undisputed, however. In his criticism,
Samuelson (1971, also discussed in Markowitz, 1976) considers a market with
i.i.d. returns X1 , X2 , ... and compares the expected wealth ESn∗ from a series of
log-optimal investments with the expected wealth ESn∗∗ from investment in the
fixed portfolio
b∗∗ := arg max E < b, X1 >
b∈S
(maximization of expected return). Using the independence and the identical

distribution of the returns he finds that
n
ESn∗∗ E ni=1 < b∗∗ , Xi >
Q
maxb∈S E < b, X1 >
= Qn = → ∞ (n → ∞).
ESn∗ E i=1 < b∗i , Xi > E < b∗1 , X1 >
Hence there are strategies that outperform log-optimal strategies in terms of

expected terminal wealth for long run investment. However, when comparing
the investor’s strategy with a competing strategy we think that the ratio of
wealths considered (1.3.2) is more instructive than two seperate expectations
for the investor’s strategy and the competing strategy. This is a typical example
of criticism offered by classical economists who favour the Markowitz mean-
variance approach to portfolio optimization (Markowitz, 1959; Luenberger,
1998, Ch.6.4ff.). There, the investor seeks to maximize the portfolio perfor-
mance E < b, X > under the constraint of not exceeding a certain threshold
for the risk Var < b, X > (or for the value-at-risk, i.e., quantiles of the return
distribution, in more modern versions of the mean-variance approach).
We would like to emphasize that it is not a question of taste whether or not

to use the log-optimal approach. We strongly plead for investment under loga-
rithmic utility because of the following facts:
– In their spirited defence of the log-optimal criterion Algoet and Cover

(1988) come to the conclusion that the mean-variance approach lacks
generality (e.g. for non-log-normally distributed returns, see Samuelson,
1967, and for multiperiod investment, see Luenberger, 1998, Ch. 8.8).
– It is doubtful whether investment analysis should be founded on expecta-

tions (where typically Sn deviates much more from ESn than log Sn from
E log Sn , stabilizing effect of the log-transform). Pathwise results as the
second part of (1.3.2) are more instructive than results on averages.
Realistically, the true distribution of market returns and hence the log-optimal
strategy is not revealed to the investor. Then the key problem is (as Algoet,
1992, put it):
Find a non-anticipating portfolio selection scheme {b̂n}n (a so-called universal
portfolio selection scheme) such that for any stationary ergodic market process
{Xn }n , the compounded capital Ŝn := ni=1 < b̂i , Xi > will grow exponentially
Q
fast almost surely (i.e. with probability 1) with the same maximum rate as under
the log-optimum strategy {b∗n}n , that is, limn→∞ log Ŝn /n = limn→∞ log Sn∗ /n
almost surely.
To obtain a universal portfolio selection scheme, under weak conditions on the
market one may choose the log-optimal portfolio with respect to some appro-
priately consistent estimate of PXn |X1 ,...,Xn−1 in the nth investment step (more
precisely, distribution estimates that almost surely exhibit weak convergence to
the true distribution). This was demonstrated by Algoet (1992, Theorem 7). He
also provides an appropriate, yet complicated estimation scheme (Algoet, 1992,
Theorem 9). Instead, we can also use the more transparent scheme of Morvai et
al. (1996). Algoet points out that there are universal portfolio selection schemes
that do not require an explicit distribution estimation scheme as a subroutine
(Algoet, 1992, Sec. 4.3). But still, all existing algorithms seem to require an
enormous amount of past data, making their feasibility in practical situations
doubtful (as noted, e.g., in Yakowitz, Györfi et al., 1999). More practicable
results have been obtained in the case of independent, identically distributed
return vectors. For instance, Morvai (1991, 1992) and Österreicher and Vajda
MARKET MODEL
• assets i=1, ..., m
• stationary, ergodic stochastic
INVESTMENT return process {Xn}naþ+m
ACTIONS INVESTMENT GOAL
investment
• portfolio process strategy
{bn}naS • log-utility: maximisation
• transaction costs (occasionally) of expected log-returns
• no consumption • good for both short term
• no short positions and long run investment
Figure 1.2: Our approach to investment strategy planning.
(1993) propose portfolio strategies which are based on selecting the log-optimal
portfolio with respect to the empirical distribution of the data (the so-called
empirical log-optimal portfolio, more on that in Chapter 2). Those esti-
mators can be computed with reasonable effort. Repeated investment following
their strategies asymptotically yields the optimal growth rate of wealth with
probability one. However, in merely stationary and ergodic return processes
they produce suboptimal results.
This thesis aims to provide simplified, yet efficient portfolio selection algorithms
if the log-returns follow a Gaussian process (Chapters 2 and 3), a Markov pro-
cess (Chapters 4 and 5) or, more general, a stationary and ergodic process
(Chapter 6). Our approach is summarized in Figures 1.1 and 1.2.
For the sake of completeness, it should be noted that in recent years, the log-
optimality criterion has been generalized in several ways. In particular, re-
searchers tried
– to make the log-optimality criterion risk sensitive, i.e. to introduce de-

vices which allow the investor to adjust the log-optimal strategy to his
individual risk aversion level. This may be achieved in two different ways:
Either, as in the Markowitz mean-variance model, the investor seeks to
maximize the expected log-return under variance constraints (Ye and Li,
1999), or the log-utility is extended by the variance, e.g. when maximiz-
ing −(2/θ) log E exp(−(θ/2) log Sn ) = E log Sn − (θ/4)Var log Sn + O(θ2 ),
where θ > 0 is a risk aversion parameter (Bielecki and Pliska, 1999,
and Stettner, 1999, for continuous time models; Bielecki, Hernández and
Pliska, 1999, for a discrete time model).
– to rid the log-optimality criterion of its stochastic setting, making it appli-

cable to markets with doubtful stochastic properties (Cover, 1991; Cover
and Ordentlich, 1996, including side information; Helmbold et al., 1998,
investigating algorithmic issues and Blum and Kalai, 1999, for the trans-
action cost case).
It is left for future research to generalize the results of this thesis to these
extended models.
47
CHAPTER 2
Portfolio benchmarking: rates and di-

mensionality
Based on market observations, the investor can follow many different empirical
portfolio selection rules (“empirical” being synonymous with “based on histor-
ical return data”). Not all of these necessarily turn out to be a good choice
in view of the investor’s goal. Discriminating between “good” and “bad” port-
folios requires the investor to compare the performance of his portfolio with
the given investment goal. Naturally, “good” empirical portfolio selection rules
should approach the investment goal. It is of serious interest to determine how
fast the investor approaches his goal as more and more information about the
market is gathered. This is the primary task of what we might call “portfolio
benchmarking”. Portfolio benchmarking analyses how well a given portfo-
lio b or a given portfolio selection rule b = b(X1, ..., Xn) of the past returns
X1 , ..., Xn performs with respect to a fixed benchmark, in our case with re-
spect to the expected logarithmic portfolio return in the next market period,
E log bXn+1 (to keep notation simple we write log bXn+1 rather than log bT Xn+1
or log < b, Xn+1 > in the sequel). This, of course, requires a standardized way
to assess to what extent an empirical portfolio selection rule underperforms in
comparison with the log-optimal rule.
In this chapter we analyse how seriously a log-optimal portfolio selection rule
based on an estimate for the true return distribution may underperform. To this
end, we propose a specific measure of underperformance (cf. (2.1.2)). Establish-
ing a lower bound result on this measure, it will be seen that underperformance
cannot vanish at arbitrarily high rate as the investor gathers more and more
knowledge about the market (Theorem 2.1.1). All investors are subject to a
universally limited rate at which investment rules can succeed in exploiting
historical market data.
48 Chapter 2. Portfolio benchmarking: rates and dimensionality
In fact, the empirical log-optimal portfolio of Chapter 1 turns out to be a se-

lection rule that achieves the optimal rate (Theorem 2.1.3). It is particularly
striking that this rate does not depend upon the numbers of stocks included
in the portfolio selection process (Theorem 2.1.4). One is tempted to think
that arbitrarily large portfolios can be handled successfully without extra pre-
cautions. Reasons will be given why this is fallacious and does not obviate the
necessity of trying to keep the dimension of the portfolio at reasonably low level
by pre-selecting a “good” subset of all possible stocks.
However, the pre-selection of stocks is far from being an easy going thing: As we
shall see, there is no way of pre-selecting the stocks on the basis of the perfor-
mance of the single stocks only (Theorem 2.2.1). To find the optimal portfolio
configuration, the investor has to evaluate the log-optimal portfolios of all pos-
sible subsets of stocks and compare the resulting expected logarithmic portfolio
returns, a huge though necessary computational effort in high dimensions.
2.1 Rates of convergence in i.i.d. models

Suppose the m-dimensional stock return vectors X1 , X2, ... constitute a sequence
of independent, identically distributed (i.i.d.) random variables with distribu-
tion Q := PX1 . Q is not disclosed to the investor, who, after n market periods,
may exploit the observations X1 , ..., Xn to obtain a distribution estimate Q̂n of
Q. Let F and F̂n = F̂n (·, X1, ..., Xn) denote the cumulative distribution func-
tions associated with Q and Q̂n, respectively. We shall restrict our analysis to
estimators Q̂n whose sensitivity to outliers is such that

F̂n (x, X1, ..., Xi−1, Xi, Xi+1 , ..., Xn)

c(x, Xi, Xi0)

−F̂n (x, X1, ..., Xi−1, Xi0, Xi+1 , ..., Xn) ≤ (2.1.1)
n
for some function c : IR3m 0 0 m

+ → IR+ , whatever i ∈ IN, x, X1 , ..., Xn, X1 , ...Xn ∈ IR+
may be. Most of the standard distribution estimates share this property, such
as the empirical distribution
n
1X
Q̂n(A) = 1A (Xi ) (A ∈ B(IRm ))
n i=1
2.1 Rates of convergence in i.i.d. models 49
and kernel estimates

n
1 X x − Xi
Z
Q̂n(A) = K dx (A ∈ B(IRm )),
nhdn i=1 hn
A
hn being a sequence of nonnegative bandwidths and K : IRm → IR+

0 a kernel
function.
Having thus learned a “picture” of the market, Q̂n, the investor allocates his
wealth according to the corresponding log-optimal portfolio
b̂n+1 = b̂n+1 (X1 , ..., Xn) = arg max E log bY,

b∈S
where the expectation is calculated for Y ∼ Q̂n . This choice yields the random
return R̂n = b̂n+1 Xn+1 during the next market period. In order to determine
how well b̂n reproduces the true log-optimal portfolio b∗ = arg maxb∈S E log bX1
with return Rn∗ = b∗ Xn+1 , we first observe that

E log Rn∗ − E log R̂n = E log b∗ Xn+1 − E E[log b̂n(X1 , ..., Xn)Xn+1 |X1 , ..., Xn]

∗ ∗
≥ E log b Xn+1 − E E[log b Xn+1 |X1 , ..., Xn]
= E log b∗ Xn+1 − E log b∗ Xn+1 = 0,
using the independence of X1 , ..., Xn+1. Hence

Rn∗
∆(Q̂n , Q) := E log ≥0 (2.1.2)
R̂n
with equality if Q̂n = Q. On the other hand, from Theorem 1.3.4,
n
1X R∗
lim inf log i ≥ 0.
n→∞ n R̂i
i=1
Taking expectations and using Fatou’s lemma we obtain

n n
1X R∗ 1X R∗
0 ≤ E lim inf log i ≤ lim inf E log i .
n→∞ n R̂i n→∞ n R̂i
i=1 i=1
Therefore, {b̂n}n is a good portfolio selection rule if

Rn∗
∆(Q̂n , Q) = E log →0
R̂n
with high rate as n tends to ∞.

∆(Q̂n , Q) measures underperformance of b̂n w.r.t. the benchmark portfolio b∗ .
In the sequel we will derive asymptotic properties of ∆(Q̂n , Q). The following
theorem shows that the limit cannot be achieved at arbitrarily high speed of
convergence:
Theorem 2.1.1. For any sequence of distribution estimates Q̂n satisfying

(2.1.1), there exists a market distribution Q and a market constant c for which
c
∆(Q̂n, Q) ≥ (2.1.3)
n
for infinitely many n.
As will be seen in the proof, (2.1.1) is not needed when considering unbiased
estimators Q̂n, i.e., estimators for which EQ̂n(A) = Q(A) for all A ∈ B(IRm
+ ).
Proof. Consider a 2 stock market with return vector (X (1), X (2)) ∈ IR2+ and
portfolios (b, 1 − b), b ∈ [0, 1]. We can expand
E log(bX (1) + (1 − b)X (2)) = E log((Z − 1)b + 1) + E log X (2) (2.1.4)
with the return ratio Z := X (1)/X (2). Thus, in a 2 stock market, the log-
optimal portfolio only depends upon the distribution of the return ratio Z. For
simplicity, let Z be of the form
(
A with probability p,
Z= (2.1.5)
B with probability 1 − p
with p ∈ (0, 1), A, B > 0 to be chosen later.
We first consider the classical parameter estimation problem of estimating p,
which will be linked with the portfolio selection problem at a later stage. Q̂n
allows the investor to derive an estimate of p,
n o
p̂n = p̂n (z n) := Q̂n (x(1), x(2)) : x(1)/x(2) = A ,
z n ∈ {A, B}n being the observed realisations of the i.i.d. return ratios Z1 , ..., Zn
(independent of Z). If k(z n) denotes the number of A’s in z n and Bin (p) =
n i n−i

i p (1 − p) denotes the ith Bernstein polynomial of order n, we can identify
 
n  −1 n
X n X  X
fn (p) := Ep̂n (Z1 , ..., Zn) = p̂n (z n) Bin (p) =: bi,nBin (p)
 i 
i=0 n n
z :k(z )=i i=0
as a Bézier curve. For reasons to become clear later it is important to study its
derivative
n−1
X
0
fn (p) = n(bi+1,n − bi,n)Bin−1 (p).
i=0
Combinatorial arguments given at the end of this proof and relation (2.1.1)
yield
n|bi+1,n − bi,n| ≤ const. (2.1.6)
n−1
P n−1
independently of i and n. Using Bi (p) = 1 we obtain
i=0
|fn0 (p)| ≤ const. (2.1.7)
for all n and p.

We now choose the true parameter p of the model (2.1.5), which we will denote
by p∗ :
– If fn (p) 6→ p as n → ∞ for some p ∈ (0, 1), we fix this p to be the true

parameter p∗ of the distribution of Z.
– If fn (p) → p as n → ∞ for all p ∈ (0, 1), we have

Zp
fn0 (q)dq = fn (p) − fn (p/2) → p/2 (2.1.8)
p/2
as n → ∞. From this there exists a p ∈ (0, 1) with fn0 (p) 6→ 0 (otherwise

(2.1.7) and the Lebesgue dominated convergence theorem lead to a con-
tradiction in (2.1.8). This p is taken to be the true parameter p∗ of the
distribution of Z.
The mean squared error M SE(p̂n) = E(p∗ − p̂n )2+ + E(p∗ − p̂n )2− satisfies
1
E(p∗ − p̂n )2+ ≥ M SE(p̂n ) (2.1.9)
2
or
1
E(p∗ − p̂n )2− ≥ M SE(p̂n ) (2.1.10)
2
for infinetly many n. In either case the Cramér-Rao lower bound yields (for
infinitely many n)
1 fn0 (p∗ )2

∗ 2 ∗ ∗ 2
E(p − p̂n )± ≥ + (fn (p ) − p )
2 In (p∗ )
with Fisher information In (p∗ ) = np∗ (1 − p∗ ). (Lehmann, 1983, Theorem 6.4).

By the choice of p∗ , the proof is finished if we can adjust A, B such that
∆(Q̂n, Q) ≥ const. · E(p∗ − p̂n )2+
if (2.1.9) applies or
∆(Q̂n, Q) ≥ const. · E(p∗ − p̂n )2−
if (2.1.10) applies. This is done in the following.

If Z is distributed according to the general form (2.1.5), simple calculations
yield that (2.1.4) is maximized by

p(A − B) + B − 1
b = b(p) = T −
(A − 1)(B − 1)
where 
 1 if x > 1,

T (x) = x if 0 ≤ x ≤ 1,

 0 if x < 0.
Suppose (2.1.9) holds for infinitely many n. Set A := p∗ and B := 1 + p∗ . Then

b(p∗ ) = 0 and Rn∗ = X (2). In this case
Rn∗

E log X1 , ..., Xn = −E[log((Z − 1)b(p̂n) + 1)|X1 , ..., Xn]
R̂n
= − p∗ log((p∗ − 1)b(p̂n) + 1) + (1 − p∗ ) log(p∗ b(p̂n) + 1) .
More precisely, 
0
 if p̂n > p∗ ,
p∗ −p̂n
b(p̂n) = p∗ (1−p∗ ) if p∗2 ≤ p̂n ≤ p∗ ,
if p̂n < p∗ 2

1
and 
0 if p̂n > p∗ ,
Rn∗

E log X1 , ..., Xn = D(p ||p̂n ) if p∗ 2 ≤ p̂n ≤ p∗ ,
∗
R̂n  D(p∗ ||p∗2 ) if p̂ < p∗2

n
with the Kullback-Leibler distances (relative entropies)

p∗ 1 − p∗
D(p∗ ||p̂n ) = p∗ log + (1 − p∗ ) log ,
p̂n 1 − p̂n
D(p∗ ||p∗2 ) = −(p∗ log p∗ + (1 − p∗ ) log(1 + p∗ )).
The L1 -bound on the Kullback-Leibler distance (Cover and Thomas, 1991,

Lemma 12.6.1, eq. 12.139) yields
2
D(p∗ ||p̂n ) ≥ (p∗ − p̂n )2
log 2
so that
R∗

∆(Q̂n , Q) = E E log n X1 , ..., Xn
R̂n
= E D(p ||p̂n )1[p∗ 2 ,p∗ ] (p̂n ) + D(p∗ ||p∗ 2 )1[0,p∗ 2 ] (p̂n )
∗

2
≥ E (p∗ − p̂n)2 1[p∗2 ,p∗ ] (p̂n) + D(p∗ ||p∗ 2 )1[0,p∗2 ] (p̂n )
log 2
2 D(p∗ ||p∗2 )

≥ min , E(p∗ − p̂n )2+ .
log 2 p∗2
This is what we wanted to show in case (2.1.9) holds. If (2.1.10) applies, we set
A := 2 − p∗ and B := 1 − p∗ and argue similarly.
It remains to prove (2.1.6). For our specific model we can assume
(
(A, 1) with probability p,
X=
(1, 1/B) with probability 1 − p.
If the observed return ratios z n and z 0n ∈ {A, B}n differ in one digit only, so do
the sequences of realisations x1 , ..., xn and x01 , ..., x0n of X that generate z n and
z 0n , respectively. Hence, using (2.1.1),

p̂n (z n) − p̂n(z 0n ) = lim Fn((A + , 1 + ), x1, ..., xn) − Fn((A, 1), x1, ..., xn)

→0+

0 0 0 0

− Fn ((A + , 1 + ), x1, ..., xn) − Fn ((A, 1), x1, ..., xn)

≤ lim+ Fn ((A + , 1 + ), x1, ..., xn) − Fn ((A + , 1 + ), x01, ..., x0n)

→0

0 0

+Fn ((A, 1), x1, ..., xn) − Fn ((A, 1), x1, ..., xn)

lim→0+ c((A + , 1 + ), x, y) c((A, 1), x, y) c
≤ max + =: .
(x,y)∈{(A,1),(1,1/B)} n n n
Let F(z n ) consist of all elements of {A, B}n which can be generated by changing
exactly one of the digits B in z n to A, and let G(z n) consist of all elements of
{A, B}n which can be generated by changing exactly one of the A0 s in z n to B.

Clearly |F(z n )| = n − k(z n) and |G(z n )| = k(z n). From this
−1 −1
n X n X
bi,n = p̂n (z n) = (n − i)−1 (n − i)p̂n (z n)
i i
z n :k(z n )=i z n :k(z n )=i
−1
−1 n
X X
= (n − i) p̂n (z n )
i
z n :k(z n )=i z 0n ∈F (z n)
−1
n X X
= (i + 1)−1 p̂n (z 0n)
i+1 n
z :k(z )=i+1 z ∈G(z )
0n 0n 0n
−1
n X X
+(i + 1)−1 (p̂n (z n ) − p̂n (z 0n))
i+1
z 0n :k(z 0n )=i+1 z n ∈G(z 0n )
 
−1
n X 1 X
= p̂n (z 0n) ·  1
i+1 i+1 n
0n 0n
z :k(z )=i+1 z ∈G(z ) 0n
−1
n X X
+(i + 1)−1 (p̂n (z n ) − p̂n (z 0n))
i+1
z 0n :k(z 0n )=i+1 z n ∈G(z 0n )
 
−1
 n X X 
= bi+1,n + (i + 1)−1 (p̂n (z n) − p̂n (z 0n )) .
 i+1 n

0n 0n z :k(z )=i+1 z ∈G(z )
0n
The latter bracket {...} is an average with constituents bounded from above in
absolute value by c/n. Hence |bi+1,n − bi,n| ≤ c/n and the proof is finished. 2
Remark. If the analysis is restricted to unbiased estimators in the sense that

EQ̂n (A) = Q(A) for all A ∈ B(IRm + ), in particular fn (p) = p, then we can
choose p∗ = 1/2 to obtain
Rn∗

2 1 1 4
E log X 1 , ..., X n ≥ E( − p̂n )2+ ≥ =
R̂n log 2 2 In (1/2) log 2 n log 2
without having to impose (2.1.1).

It is interesting to note that we can bound ∆(Q̂n , Q) in terms of the Kullback-
Leibler distance between Q̂n and Q not only from below but also from above.
This was obtained by Barron and Cover (1988, Theorem 1, see also Cover and
Thomas, 1991, Theorem 15.4.1):
Theorem 2.1.2. (Cover and Thomas, 1991) Let Q be the true return distribu-
tion and Q̂n a sequence of distribution estimates, both having densities q and
q̂n , respectively, w.r.t. some common dominating measure. Then
∆(Q̂n , Q) ≤ ED(Q||Q̂n)
with the Kullback-Leibler distance

q(x)
Z
D(Q||Q̂n ) = log Q(dx).
q̂n (x)
Remark. As a consequence of Theorem 2.1.2, consistent distribution estimates,

i.e., estimates for which ED(Q||Q̂n ) → 0, generate consistent portfolio selection
rules. The results of Györfi et al. (1994), however, show that convergence in
Kullback-Leibler distance is a considerably strong requirement. There is no
distribution estimate, for example, that is Kullback-Leibler consistent for any
return distribution on a non-finite countable set. The results of Algoet and
Cover (1988) demonstrate that almost sure weak convergence of Q̂n to Q suffices
to obtain consistent portfolio selection rules.
Proof. It suffices to show

Rn∗

E log X1 , ..., Xn ≤ D(Q||Q̂n), (2.1.11)
R̂n
the assertion follows taking expectations. If D(Q||Q̂n ) = ∞ there is nothing to

prove. So, assume D(Q||Q̂n ) < ∞, which implies absolute continuity of Q w.r.t.
Q̂n , i.e., Q Q̂n. Set A := {x : R̂n = b̂nx > 0, q(x) > 0, q̂n (x) > 0}. From the
Kuhn-Tucker conditions (Theorem 1.3.3) it is clear that Q̂n ({x : R̂n > 0}) = 1
and, by Q Q̂n , also Q({x : R̂n > 0}) = 1. From this and using Q Q̂n
again, Q(A) = 1. Thus
Rn∗ b∗ x
Z
E log X 1 , ..., X n = log Q(dx)
R̂n b̂n x
A
∗
b x q̂n (x) q(x)
Z
= log · · Q(dx)
b̂nx q(x) q̂n (x)
A
∗
b x q̂n (x)
Z
= log · Q(dx) + D(Q||Q̂n )
b̂nx q(x)
A
Z ∗
bx
≤ log Q̂n (dx) + D(Q||Q̂n ) ≤ D(Q||Q̂n ),
b̂n x
A
the latter inequality by the Kuhn-Tucker conditions again. 2

From this theorem we can easily infer that the lower bound on ∆(Q̂n, Q) given
in (2.1.3) is sharp, and that there are estimators attaining the optimal rate of
decay, O(1/n). In particular, the log-optimal portfolio based on the empirical
distribution does so.
Theorem 2.1.3. Assume the return distribution Q is supported on a finite set

M. Let Q̂n be the empirical distribution on M after n observations and
n
1X
b̂n+1 := arg max log bXi
b∈S n i=1
the associated log-optimal portfolio, then

|M|
∆(Q̂n , Q) ≤ . (2.1.12)
n
Proof. Let Γn := {µ|µ empirical distribution of n data on M, D(µ||Q) > }.

Then according to Sanov’s theorem (Cover and Thomas, 1991, Ch. 12)

Q D(Q||Q̂n ) > = Q Q̂n ∈ Γn

≤ |M| exp −n min D(Q||µ) ≤ |M| exp(−n).
µ∈Γn
From this we calculate

Z∞ Z∞
|M|
ED(Q||Q̂n ) = Q D(Q||Q̂n ) > d ≤ |M| exp(−n) = .
n
0 0
Application of Theorem 2.1.2 proves the theorem. 2
It is an interesting feature of inequality (2.1.12) that the rate itself does not
deteriorate when the number m of stocks in the market model grows. This is
rather untypical of nonparametric estimation problems. Of course, the number

of stocks in the market influences the constant in ∆(Q̂n, Q) = O(1/n): We have
proven that for the empirical log-optimal strategy
Ri∗ c(i)
E log ∼
R̂i i
with c(i) = O(1). To see how c(i) depends on m we allow ourselves some
heuristics: Móri (1982) proved that for a (slight) modification of the empiri-
cal log-optimal strategy, n(m−1)/2 Ŝn /Sn∗ converges to a non-degenerate random
variable Z,
Ŝn
n(m−1)/2 ∗ → Z in distribution.
Sn
This can be rewritten as
n
R∗

X m−1 1
· − log i → log Z in distribution.
i=1
2 i R̂i
Taking expectations, the left hand side becomes

n n n n
X m−1 1 X R∗ X m − 1 1 X c(i)
· − E log i ∼ · − ,
i=1
2 i i=1 R̂i i=1
2 i i=1
i
both sums being of logarithmic growth. To obtain convergence we infer c(i) ∼

m−1
2 , the constant possibly growing linearly with the number of stocks.
Up to a logarithmic factor, the phenomenon of the rate being insensitive to

m carries over to more sophisticated settings where the return vector is not
necessarily restricted to finitely many outcomes. In particular, we have the
following theorem for the empirical log-optimal portfolio.
Theorem 2.1.4. Assume the return distribution Q is concentrated on a cube

[A, B]m with 0 < A ≤ B < ∞). Let
n
1X
b̂n+1 := arg max log bXi
b∈S n
i=1
be the empirical log-optimal portfolio, then

q
log n
∆(Q̂n, Q) = o
n1/2
for any q > max{(m − 1)/2, 1}.
Up to the logarithmic factor, the rate coincides with the classical rate n−1/2 of
stochastic parameter estimators – regardless of the portfolio dimension m.
Proof. For the proof we can assume m > 2: If m = 2 we artificially produce
a market with 3 assets from the original 2 stocks and a bond returning A/2
in each market period. In this setting, we never invest in the bond, i.e., log-
optimal investment is the same as in the original 2 stock market. A rate result
for the 3 stock market carries over to the 2 stock market.
First we make some preliminary observations on the covering of the simplex
S := {b ∈ IRm |bi ≥ 0, m
P
i=1 bi = 1}.
0
:= {b ∈ IR |bi ≥ 0, m
m 0
P
Let Sm i=1 bi ≤ 1} and define the mapping F : Sm−1
Pm−1
→
S : (x1, ..., xm−1) 7→ (x1, ..., xm−1, 1 − i=1 xi). Fix some > 0. Clearly, we can
0
cover Sm−1 ⊆ [0, 1]m−1 with N ≤ d1/δem−1 k · k∞ -balls of radius δ := /(m − 1)
0
centered at c(1) , ..., c(N ) ∈ Sm−1 . For any x ∈ S

(i)

inf (x1 , ..., xm) − F (c )
i=1,...,N ∞
( m−1 )
(i)
X (i)
= inf max (x1, ..., xm−1) − c , (cj − xj )
i=1,...,N
∞ j=1

(i)

≤ inf (m − 1) (x1 , ..., xm−1) − c ≤ .
i=1,...,N ∞
It follows that S can be covered by at most d(m − 1)/em−1 k·k∞ -balls of radius
.
Let X1 , ..., Xn denote independent return data and augment this family by a
random variable X independent of X1 , ..., Xn with the same distribution as X1 .
Introduce the following abbreviations:
n n
1X 1X
Φn := max log bXi = log b̂n Xi ,
b∈S n i=1 n i=1
Ln := E[log b̂n X|X1 , ..., Xn]
and
L∗ := max E log bX = E log b∗ X.

b∈S
Clearly,
Ln ≤ max E[log bX|X1 , ..., Xn] = max E log bX = L∗ .
b∈S b∈S
To bound the tail probability of L∗ −Ln ≥ 0 we use the following decomposition
P(L∗ − Ln > ) = P(L∗ − Φn + Φn − Ln > )

≤ P(L∗ − Φn ≥ − δ) + P(Φn − Ln ≥ δ)
=: K1 + K2 ,
where δ > 0 is chosen later.

Bounding K1: We start with

n n
∗ 1X 1X
L − Φn = max E log bX − max log bXi ≤ max E log bX − log bXi .

b b n b n
i=1 i=1
For τ > 0 to be chosen later, cover S by N ≤ d(m − 1)/τ em−1 k · k∞ -balls of

radius τ centered at b(1), ..., b(N ). Choosing c1 such that | log bX − log b̃X| ≤
c1 |b − b̃| for all X ∈ [A, B]m we obtain

n n
1X
(j) 1X (j)

max E log bX − log bXi ≤ max E log b X − log b Xi + 2c1 τ.

b n
i=1
j=1,...,N n i=1

Under the condition − δ − 2c1 τ > 0, the Hoeffding inequality (Petrov, 1995,
2.6.2, note that | log bX| ≤ max{| log A|, | log B|} =: c2 < ∞) yields
N
n
!
X 1 X
K1 ≤ P E log b(j)X − log b(j) Xi ≥ − δ − 2c1 τ

j=1
n i=1

n( − δ − 2c1 τ )2

≤ 2N exp −
2c22
m−1
n( − δ − 2c1 τ )2

m−1
≤ 2 exp − . (2.1.13)
τ 2c22
Bounding K2:
n
1X
Φn − Ln = log b̂nXi − E[log b̂n X|X1, ..., Xn]
n i=1

≤ max log b(j) Xi − E log b(j) X + 2c1 τ,
j=1,...,N
and under the condition δ − 2c1 τ > 0 we apply Hoeffding’s inequality again to
obtain
m−1
n(δ − 2c1 τ )2 n(δ − 2c1 τ )2

m−1
K2 ≤ N exp − ≤ exp − .
2c22 τ 2c22
(2.1.14)
Combining (2.1.13) and (2.1.14) for δ := /2, τ := /(8c1 ) yields
m−1
n2

∗ 8c1 (m − 1)
P(L − Ln > ) ≤ 3 exp − 2 .
32c2
This in turn implies
Z∞
∗
∆(Q̂n, Q) = E(L − Ln) = P(L∗ − Ln > ) d
0
Z∞ m−1
n2

8c1 (m − 1)
≤ an + 3 exp − 2 d
32c2
an
 
Z∞ m−1
1
exp −c3 na2n z dz 

= an 1 + 24c1 (m − 1)

an z
(8c1 (m−1))−1
with c3 := 2c21 (m − 1)2 /c22 and an > 0 still to be adjusted.

We now have to bound integrals of the latter type: For a > 0, m > 2 and
an := cn−1/2 logq n (arbitrary c > 0, q > (m − 1)/2) we find that
Z∞ m−1 Z∞
1 2 m−2 1
exp(−c3 nan z) dz ≤ 2 + 1 exp(−c3 na2n z) dz
an z (anz)m−1
a a
where we have used the inequalities dxe ≤ x + 1 and Jensen’s inequality to

obtain (x + 1)m−1 = 2m−1 ((x + 1)/2)m−1 ≤ 2m−1 (xm−1 + 1)/2 = 2m−2 (xm−1 + 1).
Bounding the last integral yields
Z∞ m−1
1
exp(−c3 na2n z) dz
an z
a
Z∞ Z∞
m−2 1 1
≤2 exp(−ac3 na2n ) m−1 dz + 2 exp(−c3 na2n z) dz
m−2
z m−1
an
a 0
m−2 1/2
2 exp(−ac3 na2n) π 1
= + 2m−3 .
a (m − 2)am−1
n c3 n1/2 an
2.2 Dimensionality in portfolio selection 61
For n sufficiently large

(m−1)/2
2q m−1 1
exp(−ac3 na2n) 2
= exp(−ac3 c log n) ≤ exp − log n = ,
2 n
and thus m−1

exp(−ac3 na2n)

1 1
≤ ≤ .
am−1
n c logq n c logq n
By the choice of an , 1/(n1/2 an ) = 1/(c logq n), and we end up with
c logq n const.

const.
∆(Q̂n, Q) ≤ an 1 + q = + 1/2
c log n n1/2 n
for all n greater than some integer that depends on m, a, c3 and c. Hence,
n1/2
lim sup ∆(Q̂n, Q) ≤ c
n→∞ logq n
and from c > 0 being arbitrary, we infer
n1/2
lim sup ∆(Q̂n , Q) = 0,
n→∞ logq n
the assertion for the case m > 2. 2
2.2 Dimensionality in portfolio selection

Let the investor operate in a market of M stocks with random one day re-
turns X (i) (i = 1, ..., M ). Typically, M is large, e.g., M = 30 for the DAX
(Frankfurt) or Dow Jones IA (New York) stocks, M = 100 for the FTSE100
(London). Common wisdom tells us “don’t put all your eggs in one basket”, the
economist’s version of this saying (as Samuelson, 1967, put it) goes “diversifi-
cation pays”. One is tempted to think the more diversified the portfolio, i.e.,
the more stocks we include in via log-optimal portfolio selection, the better.
The results of the last section, where we have seen that the number of stocks
does not affect the rate at which empirical log-optimal portfolios approach the
optimal performance, make us particularly optimistic. However, we should not
forget that there are also several reasons to avoid selection from a huge set of
stocks:
1. M very much affects the scale of finite sample underperformance via the
constants in the rate results (recall what we infered from Móri, 1982).
2. Standard optimisation methods are computationally demanding in high

dimension.
3. If the log-optimal portfolio is calculated with, e.g., Cover’s algorithm

(Cover, 1984), then at each iteration step an M -dimensional integration
has to be carried out, which requires considerable computational effort.
Also, Cover’s algorithm requires exact knowledge of the M -dimensional
return distribution. In practice, such information must be gathered by
statistical distribution estimation which faces substantial difficulties for
high dimension M (curse of dimensionality, see e.g. Scott, 1992, Chapter
7).
For these reasons, the investor should work with a medium size range of stocks
at a time only. In other words, he will have to pre-select m < M stocks from
the whole market. These pre-selected stocks are the assets he includes in a
log-optimal portfolio. For illustrative purposes, we restrict ourselves to M = 3.
In this case, the investor may compose a log-optimal portfolio out of 6 possible
combinations of 1 or 2 stocks.
∗
V{n} := E log X (n)
is the maximal expected log-return when the portfolio is composed of stock n

only. If two different stocks n and m are considered, the maximal expected
log-return is
∗
V{n,m} := max E log (1 − b)X (n) + bX (m) .
0≤b≤1
A natural (and in fact a frequently used) way for pre-selection is to start with
a first “draught-horse” stock (say stock A) for our portfolio, i.e., a stock such
∗
that V{A} is large. From the two remaining contenders (say stocks B and C) the
∗
investor then includes the one with good single performance, e.g. B if V{B} >
∗ ∗ ∗
V{C} . The hope is to attain the optimum V{A,B} = max{n,m}⊆{A,B,C} V{n,m} .
The following result gives conditions under which this method is doomed to
failure in the realistic market model of log-normally distributed returns. More
∗
precisely, markets with log-normal returns are characterised for which V{1} <
∗ ∗ ∗ ∗
V{2} < V{3} and V{2,3} < V{1,2} at the same time, the two best single stocks
forming a poorer portfolio than the two worst stocks in the market. As a
consequence, in order to select the optimal 2 stock combination from the market,
the investor has to evaluate all M

m possible choices, a huge computational effort
in high dimensions – a effort though that cannot be avoided. This is in contrast
to the Markowitz mean-variance approach, where a portfolio built of stock 1
and 2 may be superior to a portfolio built of stock 2 and 3 in terms of risk
(i.e., variance of portfolio return), never though in terms of performance (i.e.,
expected portfolio return).
Theorem 2.2.1. Consider a given variance-covariance matrix

 
σ12 σ12 σ13
Σ =  σ12 σ22 σ23  .
 
σ13 σ23 σ32
Then the condition

σ32 − 2σ23 < σ12 − 2σ12 (2.2.1)
is necessary and sufficient for a three stock log-normal market

X (i) := exp(Y (i)) (i = 1, 2, 3) with Y (1), Y (2), Y (3) ∼ N ((µ1 , µ2 , µ3 ) ; Σ)
to exist such that
∗ ∗
µ1 < µ2 := 0 < µ3 and V{2,3} < V{1,2}
simultaneously.
The assertion of Theorem 2.2.1 remains valid if a common µ ∈ IR is added to

µ1 , µ2 and µ3 .
The theorem implies that single stock performance is of secondary importance
in comparison with harmonious teamwork of the stocks. The deeper reason for
this is the effect of “volatility pumping” (Luenberger, 1998, Examples 15.2 and
15.3): The specific volatility structure (i.e., covariance structure) in the market
may “pump” growth from one stock to others in the portfolio. In our example, if
the covariance σ12 of stock 1 and stock 2 is sufficiently less than the covariance
of stock 1 and stock 3 – preferably sufficiently negative such that whenever 1
plunges, 2 is likely to increase – then more substantial growth can be achieved
by balancing stocks 1 and 2 rather than stocks 1 and 3, even though stock 2
might have poorer single performance than stock 3.
Corresponding results for dimension reduction in pattern recognition have been
obtained by Touissaint (1971, also discussed in Devroye, Györfi and Lugosi,
1996, Theorem 32.2).
For the proof of the theorem, we need a number of preliminary observations.
First consider a 2 stock market with log-normally distributed returns X (i),
X (i) := exp(Y (i)) (i = 1, 2)
and ! ! !!
Y (1) µ1 σ12 σ12
∼N ; .
Y (2) µ2 σ12 σ22
Log-optimal investment in log-normal markets has been considered, e.g., by
Ohlson, (1972). As noted in Chapter 1, the log-optimal portfolio

b(1)∗, b(2)∗ := arg min E log(b(1)X (1) + b(2)X (1))
b(1) ,b(2) ≥0,b(1) +b(2) =1
satisfies the necessary and sufficient Kuhn-Tucker conditions

( (
X (i) =1 >0
E (1)∗ (1) if b(i)∗
b X + b(2)∗X (2) ≤1 =0
(Theorem 1.3.3). In other words
X (2)
(1, 0) log-optimal ⇐⇒ E ≤1 (2.2.2)
X (1)
(1)
X
(0, 1) log-optimal ⇐⇒ E (2) ≤ 1 (2.2.3)
X
(1) (2) X (1)
(b , b ) log-optimal ⇐⇒ E (1) (1)
b X + b(2)X (2)
X (2)
= E (1) (1) = 1, (2.2.4)
b X + b(2)X (2)
the latter for b(1), b(2) > 0, b(1) + b(2) = 1.

We rephrase the Kuhn-Tucker conditions in a form that will be more convenient
to use. To this end we define Z := Y (2) − Y (1) ∼ N (µ2 − µ1 ; σ12 − 2σ12 + σ22)
and connect b(1), b(2) 6= 0 to r ∈ (0, ∞) via r := b(2)/b(1), b(1) = 1/(1 + r) and
b(2) = r/(1 + r). Then we can rewrite the right hand sides of (2.2.2) and (2.2.3)
as
E exp Z ≤ 1, (2.2.2’)
E exp(−Z) ≤ 1. (2.2.3’)
By simple calculations, the right hand side of (2.2.4) is equivalent to the exis-
tence of
exp Z − 1 (2.2.4’)
r ∈ (0, ∞) such that E = 0.
1 + r exp Z
From (2.2.2’) to (2.2.4’) one can observe the following:
1. The log-optimal portfolio b(1)∗, b(2)∗ only depends upon

µ := µ2 − µ1 and σ 2 := σ12 − 2σ12 + σ22 ,
i.e. b(1)∗, b(2)∗ = b(1)∗(µ, σ 2 ), b(2)∗(µ, σ 2 ) .

2. Evaluating E exp Z = exp(µ + σ 2 /2) and E exp(−Z) = exp(−µ + σ 2 /2)

yields:
σ2
(1, 0) log-optimal ⇐⇒ µ ≤ − , (2.2.5)
2
σ2
(0, 1) log-optimal ⇐⇒ µ ≥ . (2.2.6)
2
3. For µ = 0, (1/2, 1/2) is the log-optimal portfolio, since by symmetry
Z∞
z2

1 exp z − 1
√ exp − 2 dz = 0.
2πσ exp z + 1 2σ
−∞
The value of the log-optimal portfolio of stock 1 and stock 2 is
V ∗ := E log(b(1)∗X (1) + b(2)∗X (2)).
However, it will be convenient to work with the portfolio improvement
Vσ (µ) := V ∗ − µ1 = E log X (1) + E log(b(1)∗ + (1 − b(1)∗) exp Z) − µ1

= E log(b(1)∗(µ, σ 2 ) + (1 − b(1)∗(µ, σ 2)) exp Z) (2.2.7)
achieved when including stock 2 in a portfolio of stock 1. Vσ (µ) only depends

upon the distribution of
Z = Zµ,σ2 ∼ N (µ; σ 2),

i.e. again on µ and σ 2 only. The following lemma summarizes basic properties
of Vσ (µ), similar to the results derived in Ohlson (1972):
Lemma 2.2.2. 1. For any fixed σ ∈ [0, ∞) Vσ (µ) is a continuous function of

µ ∈ IR, strictly increasing on [−σ 2/2, ∞).
2. Vσ (µ) = 0 for all µ ∈ (−∞, −σ 2/2] and Vσ (µ) = µ for all µ ∈ [σ 2/2, ∞).
3. Vσ (0) is a nonnegative, strictly increasing continuous function of σ ∈ [0, ∞).
Proof. 1. On the one hand the log-optimal portfolio b(1)∗(µ, σ 2), b(2)∗(µ, σ 2) is

unique (Theorem 1.3.1), on the other hand, a continuous solution, say b(1)(µ, σ 2),
b(2)(µ, σ 2 ) , to the maximization problem

E log(b(1)X (1) + b(2)X (2)) = µ1 + E log(b(1) + b(2) exp Zµ,σ2 ) −→ max !

b(1) ,b(2) ≥0,b(1) +b(2) =1
can be found (Aliprantis and Border, 1999, Theorem 16.31 together with Lemma
16.6). Hence, both coincide and b(1)∗(µ, σ 2 ) is a continuous function of µ. From
the equation Vσ (µ) = E log(b(1)∗ + (1 − b(1)∗) exp Zµ,σ2 ) the continuity assertion
follows.
Now, let −σ 2/2 < µ < ν. Then
V = E log(b(1)∗(µ, σ 2 ) + (1 − b(1)∗(µ, σ 2)) exp Zµ,σ2 )

< E log(b(1)∗(µ, σ 2 ) + (1 − b(1)∗(µ, σ 2)) exp Zν,σ2 )
≤ E log(b(1)∗(ν, σ 2 ) + (1 − b(1)∗(ν, σ 2)) exp Zν,σ2 ).
The first inequality follows from −σ 2/2 < µ, i.e. b(1)∗(µ, σ 2 ) < 1, the second
inequality holds by definition of b(1)∗ as a component of the log-optimal portfolio.
2. is a direct consequence of (2.2.5) and (2.2.6). We just check Vσ (µ) =
E log X (1) − µ1 = µ1 − µ1 = 0 for b(1)∗, b(2)∗ = (1, 0) and calculate Vσ (µ) =

E log X (2) − µ1 = µ2 − µ1 = µ for b(1)∗, b(2)∗ = (0, 1).

3. As noted above µ = 0 implies b(1)∗, b(2)∗ = (1/2, 1/2). Hence we find that

1
Vσ (0) = log + E log(1 + exp Z0,σ2 )
2
Z∞
w2

1 1
= log + √ log(1 + exp(σw)) exp − dw.
2 2π 2
−∞
From this representation we see that Vσ (0) is continuous for σ ∈ (0, ∞). More-
over, using the monotone convergence theorem (Williams, 1991, 5.3) we calcu-
late
Z∞
w2

1 1 1
V0 (0) := lim+ Vσ (0) = log + √ log 2 exp − dw = log + log 2 = 0.
σ→0 2 2π 2 2
−∞
Concerning monotonicity, we remark that in what follows the interchange of

differentiation and integration is possible by the standard theorem for integrals
depending on a parameter (see e.g. Williams, 1991, A.16.1). Thus for σ > 0
Z∞
w2

∂ 1 w
Vσ (0) = √ exp − dw.
∂σ 2π 1 + exp(−σw) 2
−∞
Finally, since w/(1 + exp(−σw)) > w/2 for all w 6= 0,

Z∞
w2

∂ 1 w
Vσ (0) > √ exp − dw = 0,
∂σ 2π 2 2
−∞
and Vσ (0) is strictly increasing for σ ≥ 0. 2

We are now in the position to finish the
Proof of Theorem 2.2.1. Sufficiency: Suppose we are given σL2 := σ32 −2σ23 +
σ22 < σU2 := σ12 − 2σ12 + σ22. Then, from part 3 (combined with parts 1, 2) of
Lemma 2.2.2, we can choose a W > 0 with
VσL (0) < W < VσU (0). (2.2.8)
By parts 1 and 2 of the lemma and the intermediate value theorem µ1 :=

Vσ−1
U
(W ) and m := Vσ−1
L
(W ) are well-defined and – observing the strict mono-
tonicity property of VσL and VσU as in part 1 of the lemma – we obtain µ1 <
0 < m from (2.2.8).
This choice is illustrated in Figure 2.1, where the reduced portfolio values Vσ (µ)
were calculated by means of the Cover algorithm (Theorem 1.3.2) and numerical
integration (composite trapezoidal rule, Isaacson and Keller, 1994, Sec. 7.5)
both with an accuracy of 10−7 .
Choose some µ3 with 0 < µ3 < m. Then, by part 1 again,
VσL (µ3 ) < VσL (m) = W = VσU (µ1 ).

0.6
âU = 1.00000
âL = 0.50000
W = 0.08940
0.4 Ü1 = - 0.05000
m = 0.08666
Vâ (Ü)
Vâ (·)
U
0.2
Vâ (·)
L
-0.2
-0.6 âU2 -0.4 -0.2 â L2 Ü1 0 mâ L
2
0.2 0.4 âU 2 0.6
2 2 2 2
Ü
Figure 2.1: An example of the situation as in the proof of Theorem 2.2.1.
Combining this with the definition of the reduced portfolio value Vσ (µi − µj ) =
∗
V{i,j} −µj in (2.2.7) for a portfolio of stocks {i = 1, j = 2} (set σ 2 = σ12 −2σ12+σ22 )
∗
and of stocks {j = 2, i = 3} (set σ 2 = σ32 − 2σ23 + σ22), we obtain V{2,3} =
∗
VσL (µ3 ) + 0 < VσU (µ1 ) + 0 = V{1,2} .
Necessity: If instead of (2.2.1) we assume σ32 −2σ23 ≥ σ12 −2σ12 , then µ1 < µ2 :=
∗ ∗
0 < µ3 implies V{1,2} = VσU (µ1 ) + 0 ≤ VσU (0) ≤ VσL (0) < VσL (µ3 ) + 0 = V{2,3}
(Lemma 2.2.2, parts 1 and 2 for the first and third inequality, part 3 for the
second). 2
2.3 Examples
We conclude this chapter with examples of a real market where a situation as in
Theorem 2.2.1 is set up by empirical market data. We assume that the distribu-
tion of the returns is log-normal with the parameters provided by the standard
2.3 Examples 69
estimates of mean and variance (thus admittedly oversimplifying things). Fig-

ure 2.2 a-c) shows some diagnostic checks run on American Express Co. (AXP)
daily log-return data from the closing prices at the New York Stock Exchange
2/1/1998 - 30/11/2000 (data from www.wallstreetcity.com).
Comparing the histogram of the data in Figure 2.2 a) and the normal density
with estimated mean and variance, it should be possible to approximately ex-
plain the data by assuming a log-normal return distribution. This is supported
by the normal probability plots of the log-return data in 2.2 b): lines a,b and
c correspond to the normal probability plots from 245 consecutive data each,
plot b being moved to the right by 0.05, c by 0.1. 95% confidence bands are
shown. The lines a,b and c being roughly parallel there is no alarming sign that
the return distribution loses stationarity during the period under investigation.
The sample autocovariance function in 2.2 c) (95% confidence bands around
zero) suggests approximately uncorrelated (in the Gaussian case independent)
day to day data.
Example 2.1: The following 3 stocks are chosen from the range of the Dow
Jones Industrial Average.
i stock estim.: µi − µ2 σi2 − 2σi2 + σ22

−4
1 American Express Co. (AXP) −0.4503008 · 10 5.1832410 · 10−4
2 Citigroup Inc. (C) 0.0000000 0.0000000
−4
3 United Technologies Corp. (UTX) 0.1215608 · 10 8.4458413 · 10−4
The third column reports estimates based on the empirical mean and variance
of the difference Y (i) − Y (2) of the log-returns of stock i and stock 2.
Suppose we want to enhance a portfolio of C by either AXP or UTX. Since
σ12 − 2σ12 + σ22 < σ32 − 2σ23 + σ22 we conclude that there is no indication that we
should prefer AXP to UTX.
Example 2.2: Next, consider the following stocks from the Dow Jones Trans-
portation Average:
i stock estim.: µi − µ2 σi2 − 2σi2 + σ22

−4
1 J.B. Hunt Transp. Serv. (JBHT) −0.4207553 · 10 17.5748915 · 10−4
2 Yellow Corp. (YELL) 0.0000000 0.0000000
3 Union Pacific Corp. (UNP) 0.3721663 · 10−4 10.7871249 · 10−4
15
relative frequency [%]
10
5
0
-0.10 -0.05 0.00 0.05 0.10

data
2.2 a) Histogram and estimated normal density.
99
a b c
95
90
80
70
percentage
60
50
40
30
20
10
5
-0.10 -0.05 0.00 0.05 0.10 0.15

data
2.2 b) Normal probability plot.
2.3 Examples 71
7.5
autocovariance [ 10 -4 ]
5.0
2.5
0.0
0 10 20 30 40 50
lag
2.2 c) Sample autocovariance function.
Figure 2.2: Diagnostic plots for American Express Co. (AXP) log-returns from
closing prices NYSE, 2/1/1998-30/11/2000.
If we want to include either JBHT or UNP in a portfolio of YELL shares, we

observe that σ12 −2σ12 +σ22 > σ32 −2σ23 +σ22 . Hence it may be advisable to choose
JBHT as an additional stock in spite of its apparently poorer performance.
Indeed, calculating the log-optimal portfolio for the alternatives we obtain:
∗ ∗
additional stock j weight b of YELL V{2,j} − V{2} residual value
1 (JBHT) 0.523951 1.9910 · 10−4 −1.6092 · 10−10
3 (UNP) 0.465490 1.5407 · 10−4 1.3154 · 10−10
∗ ∗
V{2,j} − V{2} is the improvement of the portfolio value achieved under inclusion
of stock j. As can be seen our suspicion was justified: Choosing JBHT yields
(slightly) greater portfolio improvement than UNP. The residual value in the
fourth column is 1 − EX (2)/(bX (2) + (1 − b)Xj ) and indicates that the Kuhn-
Tucker condition (2.2.4) for log-optimality of the stated portfolio weight b is
satisfied. The values in the third and fourth column were computed with an
error of at most 10−9 using the composite trapezoidal rule.
73
CHAPTER 3
Predicted stock returns and portfolio se-

lection
In the last chapter we have seen that portfolio selection on the basis of single-
stock performances alone is problematic. Information about the variance-cova-
riance structure of the stock returns in the market is indispensable. Of course,
there are several ways to incorporate such information. For example, in a
log-normal market, an investor might simply use an estimate of the variance-
covariance matrix of the return distribution (conditioned on the past) and run
Cover’s algorithm (Theorem 1.3.2). Another investor might first try to use
market observations to get an idea how the stock returns in the market are
correlated and what temporal correlation prevails. Given this knowledge, he
then produces forecasts of the stock returns for the next market period and re-
arranges his portfolio according to these forecasts – typically in a “greedy” way,
i.e. trying to pick out the return maximal stock for the next market period.
Clearly, the conditional log-optimal portfolio depends on much more than mere
forecasts of the returns in the next market period. Therefore, an investor follow-
ing the “greedy” strategy is bound to lose out in comparison with the investor
using conditional log-optimal portfolios. But still, the greedy strategy is pop-
ular among investors (the forecasting bit typically being much of a heuristics)
and should therefore be analysed thoroughly. This is the task of this chapter.
In Section 3.1 we formalize the greedy strategy, which gives rise to two questions:
How suboptimal can the strategy be in comparison with log-optimal portfolio
selection, and, what is a reasonable way to manage the forecasting part? As to
the first question, we will see that suboptimality can be bounded in terms of
the variance of the logarithmic returns (cf. (3.1.4)). In view of this, the greedy
strategy is appealing in markets with sufficiently small stochastic fluctuation –
“sufficiently small” depending on how much the investor is prepared to lose out
74 Chapter 3. Predicted stock returns and portfolio selection
in investment performance.
We will embark on the second question, the forecasting problem, in Section 3.2.
Once we have seen that among the many ways of forecasting, so-called “strong”
forecasting is the method of choice for the greedy strategy, we will leave the stock
market behind and handle the problem in the general framework of Gaussian
time series forecasting. Clearly, the prediction of stationary Gaussian time series
is of interest very much in its own right, with applications arising in many fields.
Based on an approximation argument (Section 3.2.1), a forecasting algorithm
will be presented (Section 3.2.2) that – under weak regularity conditions –
is strongly consistent for huge classes of Gaussian processes (Theorem 3.2.2).
Explicit examples are given, highlighting how general the algorithm is (examples
after Corollary 3.2.3). The results are proved in Section 3.3. Simulations and
further examples in Section 3.4 conclude the chapter.
3.1 A strategy using predicted log-returns

To avoid unnecessary technicalities, we choose the simplest setting of a mar-
ket with one stock (with returns ..., X−1, X0 , X1 , X2, ...) and one risk-free asset
(bond with return r). Is it not hard though to develop analogous techniques
for more general markets.
The log-transform has been found to have a stabilizing effect on the return data
Xi in so far as Yi := log Xi follows a symmetric distribution (around the mean)
in many real markets. For this reason, we will use Yi rather than Xi in the
following. Under full knowledge of the process past Yn, Yn−1 , ..., the investor is
in principle advised to invest the log-optimal proportion
b∗ := arg max E[log(b exp(Yn+1) + (1 − b)r)|Yn, Yn−1 , ...] (3.1.1)
b∈[0,1]
of his wealth in the stock. However, the non-institutional (private) investor

typically takes a different stance in two respects:
– His main interest is simply to determine whether or not he should invest a

given amount of wealth in the specific stock (rather than to determine
what proportion to invest where).
– He takes the investment decision on the basis of the predicted return of the
next market period only.
3.1 A strategy using predicted log-returns 75
In particular, he may try to achieve the maximum possible return max{Xn+1 , r}

in the next market period with the following “greedy strategy”.
1. At each step n he produces an estimate Ŷn+1 for the next outcome Yn+1 on
the basis of the observed Yn, ..., Y1 (note that it is not possible to observe
the process to the infinite past).
2. He invests according to
(
1 if exp(Ŷn+1 ) ≥ r,
b∗approx := (3.1.2)
0 otherwise.
If the portfolio is not rearranged on a daily basis, but say, on a two month basis,
this is a typical example of a buy-and-hold strategy: Once a fixed amount of
money is invested according to the investor’s belief what the market will look
like in two month’s time, the portfolio remains unchanged. A similar “greedy”
buy-and-hold strategy (where only stocks from the CBS index are picked whose
predicted two month return exceeds a certain threshold) has been investigated
in a case study described in Franke et al. (2001, Sec. 16.4 and the references
there).
Comparing b∗approx = arg maxb∈[0,1] log(b exp(Ŷn+1) + (1 − b)r) and (3.1.1), we
see that the greedy strategy is very much in the spirit of approximating the
log-optimality principle. However, b∗ will be a function of Yn, Yn−1 , ... rather
than of a single statistic Ŷn+1, and the log-optimal portfolio will be diversified
(i.e., b∗ ∈ [0, 1]), not just 0 or 1. As a consequence the investor loses out
on investment performance in comparison with the log-optimal portfolio. Two
questions arise:
– How should he construct the statistic Ŷn+1 in order not to lose out “too
much” in comparison with log-optimal portfolio performance?
– What performance loss does the particular Ŷn+1 inflict on the investor in
the worst possible case?
These are the two problems analysed in this section.

If we want to approximate the log-optimal b∗ on the basis of nothing else but
Ŷn+1 , we need to find an approximation g of the target function,
E[log(b exp(Yn+1 ) + (1 − b)r)|Yn, Yn−1, ...] ≈ g(b, Ŷn+1, r).

Not every g and not every Ŷn+1 will be appropriate for this kind of approxima-
tion. A Taylor expansion may give us some guideline: Put
fb (y) := log(b exp(y) + (1 − b)r)
and note that
1
0 ≤ fb0 (y) = ≤ 1,
1 + (1/b − 1)r exp(−y)
(1/b − 1)r exp(−y) 1
0 ≤ fb00 (y) = ≤ .
(1 + (1/b − 1)r exp(−y))2 4
Now consider the expansion
1
fb (Yn+1 ) = fb (Ŷn+1 ) + fb0 (Ŷn+1)(Yn+1 − Ŷn+1) + fb00 (ξb )(Yn+1 − Ŷn+1 )2 (3.1.3)
2
with some random ξb from the convex hull of Yn+1 and Ŷn+1. From (3.1.3) and
the σ(Yn, Yn−1, ...)-measurability of Ŷn+1 we obtain
E[fb (Yn+1 )|Yn, Yn−1, ...] = fb (Ŷn+1 ) + fb0 (Ŷn+1)(E[Yn+1|Yn , Yn−1, ...] − Ŷn+1 )
1
+ E[fb00(ξb )(Yn+1 − Ŷn+1 )2 |Yn, Yn−1 , ...].
2
As can be seen, the choice
Ŷn+1 := E[Yn+1|Yn, Yn−1 , ...]
not only makes the first order term vanish but also mimimizes the upper bound

1
E[fb00 (ξb)(Yn+1 − Ŷn+1)2 |Yn , Yn−1, ...] ≤ 1 E[(Yn+1 − Ŷn+1)2 |Yn , Yn−1, ...]

2 8
on the second order term.
Using b∗approx (based on Ŷn+1 ), the investor loses at most
E[fb∗ (Yn+1) − fb∗approx (Yn+1)|Yn , Yn−1, ...]
= E[fb∗ (Yn+1) − fb∗approx (Ŷn+1)|Yn , Yn−1, ...]
+E[fb∗approx (Ŷn+1) − fb∗approx (Yn+1)|Yn, Yn−1 , ...]
≤ E[fb∗ (Yn+1) − fb∗ (Ŷn+1)|Yn, Yn−1 , ...]
+E[fb∗approx (Ŷn+1) − fb∗approx (Yn+1)|Yn, Yn−1 , ...]
1
= E[(fb00∗ (ξb∗ ) − fb00∗approx (ξb∗approx )) · (Yn+1 − Ŷn+1)2 |Yn , Yn−1, ...]
2
1
≤ E[(Yn+1 − Ŷn+1)2 |Yn , Yn−1, ...]
8
1
= Var[Yn+1 |Yn, Yn−1 , ...].
8
3.2 Prediction of Gaussian log-returns 77
Hence
1
Efb∗ (Yn+1 ) − Efb∗approx (Yn+1 ) ≤ EVar[Yn+1 |Yn, Yn−1 , ...]
8
1 1
= (VarYn+1 − VarŶn+1 ) ≤ VarYn+1, (3.1.4)
8 8
and on the average the investor won’t lose more than 18 VarYn+1 . If he is prepared
to sacrifice this amount, then the greedy strategy (3.1.2) is possible. Still, this
does not obviate the necessity to estimate Ŷn+1 = E[Yn+1|Yn , Yn−1, ...], but as
we will see in the next section, for many practically relevant markets this can
be done with reasonable effort.
3.2 Prediction of Gaussian log-returns

Prediction problems as in the previous section are one of the core topics of
statistical analysis of time series. Using the distinctions of prediction problems
in Györfi, Morvai and Yakowitz (1998), the problem of estimating E[Yn+1|Fn ]
for some sub-σ-algebra Fn of σ(Yn, Yn−1, ...) is a problem of dynamic forecasting,
i.e., in each market period, the target to be estimated changes (“moving target”).
Typical choices for Fn are σ(Yn, Yn−1, ...), σ(Yn, ..., Y0) or σ(Yn, ..., Yn−dn+1 ) for
some sequence dn ∈ IN with dn → ∞, depending on what length of the process
past should be included. It should be noted that although we consider a bi-
infinite sequence of random variables {Yi }∞ i=−∞ (square-integrable and defined
on a common probability space (Ω, A, P)), we only observe realisations from
“time” i = 1 onwards.
The majority of existing algorithms for nonparametric dynamic forecasting (for
an introduction see Bosq (1996) or Györfi, Härdle et al. (1998)) rely on mix-
ing conditions which can rarely be verified from observational data. Inter-
est has turned to dynamic forecasting under weak conditions such as mere
stationarity and ergodicity, but avoiding mixing conditions. In this context,
a forecaster Ê(Yn, ..., Y0) for the conditional expectation E[Yn+1|Fn ] is called
strongly (weakly) universally consistent, if

lim Ê(Yn, ..., Y0) − E[Yn+1|Fn ] = 0

n→∞
with probability 1 (in the L1 -sense) for any stationary and ergodic process
{Yi }i. By stationarity, the computation of weakly consistent estimators (Fn :=
σ(Yn, Yn−1, ...) for the moment) can be reduced to the so-called static forecasting
problem (Györfi, Morvai and Yakowitz, 1998), to find Ê such that

lim EÊ(Y0 , ..., Y−n) − E[Y1|Y0 , Y−1, ...] = 0.
n→∞
Based on conditional distribution estimates P̂ (dy|Y0 , ..., Y−n), Morvai, Yakowitz

and Algoet (1997) obtained weakly consistent estimators for the class of bounded
stationary ergodic processes, Algoet (1999) for the class of finite-mean station-
ary ergodic processes. Here again, to obtain convergence with probability 1
instead of L1-convergence, mixing conditions were needed. Concerning strong
universal consistency, we encounter various limitations, one of the most strik-
ing derived by Bailey (1976), with Ryabko (1988) sketching an easier intu-
itional proof formalised by Györfi, Morvai and Yakowitz (1998). Their re-
sult states that for any estimator Ê(Yn, ..., Y0) of the conditional expectation
E[Yn+1|Yn , ..., Y0], there is a stationary ergodic, binary-valued process {Yi}i such
that
1 1
P lim sup Ê(Yn, ..., Y0) − E[Yn+1|Yn , ..., Y0] ≥ ≥ .
n→∞ 4 8
Algoet (1999) used refined techniques to show that there also exists a stationary
ergodic, binary-valued sequence {Yi}i with

1
E lim sup Ê(Yn, ..., Y0) − E[Yn+1|Yn , ..., Y0] ≥ .
n→∞ 2
This rules out the existence of strongly universally consistent forecasters for the
moving target E[Yn+1|Yn, ..., Y0].
This result is discouraging, but it does not rule out the existence of strongly
consistent forecasting rules for log-return processes as they arise in real financial
markets. In particular, Gaussian log-return processes have been proven to be a
good approximation for real log-return processes. Györfi, Morvai and Yakowitz
(1998) note that there has not yet been found an answer to the question whether
strongly consistent forecasters for E[Yn+1|Yn, ..., Y0] or even E[Yn+1|Yn, Yn−1 , ...]
exist in case the process {Yi}i is known to be Gaussian. The results of this
section show that for notably wide classes of Gaussian processes the answer is
affirmative. It should be noted, however, that strong universal consistency is
by far not the only means of “strong” forecasting. Other methods, so-called
“universal” predictors are based on Cesàro-convergence, i.e. the average of the
errors (e.g., squared prediction errors, or the squared differences of the estimates
and E[Yn+1 |Yn, Yn−1 , ...]) converges with probability one to the minimal possible
value for any (bounded) stationary and ergodic process (Algoet, 1994). Such
estimators were obtained by Algoet (1992) and Morvai et al. (1996). Based on
Györfi, Lugosi and Morvai (1999), universal predictors for bounded or Gaussian
stationary ergodic processes have been constructed by Györfi and Lugosi (2001).
Throughout, we make the following two assumptions:
1. Let {Yn}∞n=−∞ be a real-valued, purely nondeterministic (i.e., there is no

deterministic term in the representation (3.2.2) below), stationary and
ergodic Gaussian process with
EYn = 0, VarYn = σ 2 > 0 (3.2.1)
and autocovariance function γ(k) := E(Yn+k Yn ). The assumption EYn =

0 is no restriction, it follows after differencing the original process of log-
returns. We denote the differenced process by {Yi} again.
2. From 1. and Wold’s decomposition theorem (Hida and Hitsuda, 1993,

Theorems 3.2 and 3.3; Shiryayev, 1984, VI §5 Theorem 2), a canonical
L2 -representation
X∞
Yn = ψj n−j (3.2.2)
j=0
P∞ 2
can be found with ψ0 = 1, j=0 |ψj | < ∞ and independent, identi-
cally N (0, σ2)-distributed innovations n . For z ∈ C,
I |z| < 1, the series
P∞ j
:= ∞ j
P
ψ
j=0 j z converges to the transfer function ψ(z) j=0 ψj z , which
never vanishes for |z| < 1 (Hida and Hitsuda, 1993, III §3 i).
We assume that
∞
X
|ψj | < ∞, (3.2.3)
j=0
ensuring that the equality (3.2.2) holds with probability 1 (Brockwell and
Davis, 1991, Prop. 3.1.1).
Statistical and theoretical aspects of second order stationary processes are

treated extensively in literature, among many others in Brockwell and Davis
(1991), Caines (1988) and Hannan and Deistler (1988). Many results on Gaus-
sian processes can be found in Neveu (1968), Ibragimov and Rozanov (1978)
and Hida and Hitsuda (1993). Indeed, (3.2.1) and (3.2.3) are standard assump-
tions in time series analysis, and a considerable variety of sufficient conditions
for the assumptions to hold are known. We just note that if f (λ), λ ∈ [−π, π],
is the spectral density of the process {Yn }∞
n=−∞ , then
2πf (λ) = σ2|ψ(e−iλ )|2 (3.2.4)
(Brockwell and Davis, 1991, eq. 4.4.3; Shiryayev, 1984, VI §6 eq. (16f.)), and
Zπ
log f (λ)dλ > −∞ (3.2.5)
−π
is sufficient (and in fact also necessary) for the process to be purely nondeter-
ministic (Shiryayev, 1984, VI §5 Theorem 4). The simplest setting in which
(3.2.5) holds is the case when f happens to be bounded away from 0. Then
the process is strongly mixing (Ibragimov and Linnik, 1971, Theorem 17.3.3).
However, for the purpose of our analysis, we do not require this strong property.
We divide the problem of estimating E[Yn+1|Yn , Yn−1, ...] into two steps:
1. Approximation of E[Yn+1|Yn, Yn−1 , ...] by some conditional expectation

based on a fraction of the past only, say E[Yn+1|Yn, ..., Yn−d+1] with d ∈ IN.
2. Estimation of the latter quantity from the observed data.
Here and in all the following, d should be taken as d = dn ∈ IN, dn % ∞, dn ≤ n,

rather than as a mere constant. For the sake of simplicity of notation however,
this will be suppressed most of the time.
3.2.1 An approximation result

As to the first step, the approximation step, note that by stationarity and
Doob’s conditional expectation continuity theorem (Doob, 1984, 2.I.5)

E[Yn+1|Yn , Yn−1, ...] − E[Yn+1|Yn , ..., Yn−d +1 ] → 0
n
in L1 whenever dn → ∞. As will be seen, more stringent conditions are needed

to obtain convergence with probability 1. Similar problems arise in the context
of on-line order selection for AR(∞) models. Here, founded on Rissanen’s

(1989) stochastic complexity for model comparison, the influence of increasing
dimensionality dn on the prediction error of the estimated “best” AR(dn ) model
is discussed in depth by Gerencsér (1992). The accuracy of order selection
schemes based on least squares principles is also investigated in Davisson (1965)
and Wax (1988). In contrast to these, the approximation of E[Yn+1|Yn, Yn−1 , ...]
by E[Yn+1|Yn , ..., Yn−dn+1 ] used in this section is not data-driven but chooses
deterministic dn according to the conditions given in the following lemma.
Lemma 3.2.1. If the Taylor coefficients of

∞
1 X
= φk z k (|z| < 1)
ψ(z)
k=0
satisfy
∞ r
X
2 const.
|φk | ≤ (3.2.6)
log n
k=dn +1
for some r > 1 and sufficiently large n, then

lim E[Yn+1 |Yn, ..., Ydn ] − E[Yn+1|Yn , Yn−1, ...] = 0

n→∞
with probability 1.
The proof of this result and of the next theorems is deferred to section 3.3.
3.2.2 An estimation algorithm

If we collect the autocovariances in the matrix
 
γ(0) γ(1) . . . γ(d − 1)
 γ(1) γ(0) . . . γ(d − 2) 
Γd := (γ(i − j))i,j=1,...,d = 
 
.. .. .. 
 . . . 
γ(d − 1) γ(d − 2) ... γ(0)
and
γd := (γ(d), ..., γ(1)) ,
we can obtain an explicit formula for E[Yn+1|Yn, ..., Yn−d+1]. In fact, assumption
(3.2.3) implies
γ(k) → 0 (k → ∞) (3.2.7)
(Brockwell and Davis, 1991, Probl. 3.9), and from (3.2.1) and (3.2.7) it follows
that Γd is non-singular (Brockwell and Davis, 1991, Prop. 5.1.1). For Gaussian
processes one has (Brockwell and Davis, 1991, §5.4; Shiryayev, 1984, II §13
Theorem 2)
E[Yd+1 |Yd , ..., Y1] = γd Γ−1 T
d (Y1 , ..., Yd ) ,
and the autoregression function
md (yd , .., y1) := E[Yd+1|Yd = yd , ..., Y1 = y1 ] = γd Γ−1

d (y1 , ..., yd )
T
(3.2.8)
is linear. Stationarity yields
md (yd , ..., y1) = E[Yn+1|Yn = yd , ..., Yn−d+1 = y1 ]
and thus
E[Yn+1|Yn , ..., Yn−d+1] = md (Yn, ...Yn−d+1). (3.2.9)
From (3.2.8) and (3.2.9) it is plausible to construct a simple estimator Êd,n for
the conditional expectation E[Yn+1|Yn , ..., Yn−d+1] by the following steps:
1. Estimate the autocovariances γ(0), ..., γ(d) by the sample autocovariances

n−|k|
1 X
γ̂n(k) := YiYi+|k| (k = −d, ..., d). (3.2.10)
n i=1
2. Set Γ̃d,n := (γ̂n(i − j))i,j=1,...,d . Posing Γ̃d,n = n1 AAT with the matrix A ∈
IRd×2n formed by the first d rows of the matrix
 
0 0 — 0 Y1 Y2 . . . Yn
 0 — 0 Y Y ... Y 0 
1 2 n
 ∈ IRn×2n ,
 
 | | 

0 Y1 Y2 . . . Yn 0 — 0
it is obvious that Γ̃d,n is non-negative definite. Thus

1 δi,j
Γ̂d,n := Γ̃d,n + Id = γ̂n(i − j) +
n n i,j=1,...,d
is non-singular. Hence, with (3.2.10) and γ̂n,d := (γ̂n(d), ..., γ̂n(1)), define
m̂d,n (yd , ..., y1) := γ̂n,d Γ̂−1

d,n (TLn y1 , ..., TLn yd )
T
on the analogy of (3.2.8). Here, 0 < Ln % ∞ is a sequence of truncation

heights, TLy := sgn(y) min{L, |y|} being the truncation operator.
3. Plug in the last d observations,
Êd,n := m̂d,n (Yn, ..., Yn−d+1).
Remark. Even if γ̂n,d Γ̂−1 d,n constitutes a strongly consistent estimate for the
coefficients γd Γ−1
d of the autoregression function md , it may happen that esti-
mation errors (γ̂n,d Γ̂d,n − γd Γ−1
−1
d ) which are per se “acceptable”, occur together
with large values of the Yn, ..., Yn−d+1 “plugged in”. The resulting prediction
error |E[Yn+1 |Yn, ..., Yn−d+1] − Êd,n| becomes considerably large. Suitable trun-
cation limits the size of the Yi ’s without obscuring the information they contain.
Now, denoting the maximal absolute row sum of a real matrix A = (aij )i,j by
P
kAk∞ := maxi j |aij |, we establish the following convergence result for the
proposed estimator:
Theorem 3.2.2. Assume (3.2.1), (3.2.3), and choose dn and Ln such that for
some r ≥ 4, some δ > 0 and sufficiently large n
dn ≤ nr/(2(r−2)),
(log n)2/r (log log n)2(1+δ)/r
Ln kΓ−1 2 2(r+1)/r
dn k∞ dn = O(1),
n1/2
Ln kΓ−1 2
dn k∞ dn
→ 0 (n → ∞),
n
∞
X dn 1 2
exp − 2 Ln < ∞. (3.2.11)
L
n=1 n
2σ
Then

lim E[Yn+1 |Yn, ..., Yn−dn+1 ] − Êdn,n = 0 (3.2.12)
n→∞
with probability 1.
From (3.2.11), for the choice of dn and Ln , one needs some bound on the possible
growth of kΓ−1
dn k∞ . Based on the spectral density f and its (essential) minimum
mf ≥ 0 we distinguish the following cases:
Case 1: f is bounded away from 0, mf > 0.
Case 2: f has finitely many zeros λ1 , ..., λm ∈ (−π, π] of orders p1 , ..., pm, that
is, there exist constants p− +
j , pj > 0, K ≥ 1, δ > 0 such that
1 f (λ)
< + < K
K |λ − λj |pj
for all λ ∈ (λj , λj + δ) and
1 f (λ)
< − < K
K |λ − λj |pj
for all λ ∈ (λj − δ, λj ). In this case we define the order of the jth zero as
pj := max{p− + ∗
j , pj } and set p := max{pj |j = 1, ..., m}.
Case 3: No restrictions are imposed on mf , apart from those already implied

by (3.2.1) and (3.2.3).
For each of the cases, upper bounds for kΓ−1 dn k∞ can easily be derived from
classical results and results recently obtained by Serra (1998, 1999, 2000) and
Böttcher and Grudsky (1998). These yield
Corollary 3.2.3. In cases 1-3 and under the assumptions (3.2.1) and (3.2.3)
the strong consistency relation (3.2.12) holds if, for n sufficiently large,
Case 1: dn ≤ ns and Ln := (log n)t ( 61 > s > 0, t ≥ 1).
1
Case 2: dn ≤ ns and Ln := (log n)t ( 6+4p∗ > s > 0, t ≥ 1).
s
Case 3: dn ≤ 1q log n and Ln := (log n)t (q > 4, 0 < s < 1, t ≥ 1).
Before proving the results, we give some examples to illustrate the application
of Lemma 3.2.1 and Corollary 3.2.3, such that for a suitable choice of dn and
Ln the consistency relation

lim Êdn ,n − E[Yn+1|Yn , Yn−1, ...] −→ 0
(3.2.13)
n→∞
holds with probability 1 for all processes in large classes Gi of Gaussian pro-
cesses.
Example 3.1: First, let the class G1 consist of all Gaussian processes satisfying
(3.2.1) and (3.2.3) with spectral density bounded away from zero. We choose
1
dn := bns c ( > s > 0), Ln := log n
4
and obtain (3.2.13) for any element of G1 .
Indeed, for every element of G1 , ψ(z) has no zeros for |z| = 1 by (3.2.4). Then
ψ(z) never vanishes in the closed unit disk, and 1/ψ(z) is analytic on a disk
k
around 0 with radius 1 + for some > 0. Thus φk 1 + 2 → 0 as k → ∞,
−k
hence |φk | ≤ c 1 + 2 with some constant c > 0. Set ρ := (1 + /2)−2 < 1,
then
∞ ∞
X X ρ dn
|φk |2 ≤ c2 ρk = c2 ρ ≤ (log n)−3
1−ρ
k=dn +1 k=dn +1
for n sufficiently large if only dn/ log log n → ∞. Lemma 3.2.1 applies and
Corollary 3.2.3 yields (3.2.13).
For G1 and the choice dn = O(log n) it should be noted that (3.2.13) is also
a consequence of An, Chen and Hannan (1982, Theorem 6). From there it
follows that the Yule-Walker estimates (φ̂1 , ..., φ̂dn ) of the first dn coefficients in
the AR(∞) representation satisfy the uniform convergence property
1/2 !
log log n
sup |φ̂j − φj | = O
1≤j≤dn n
with probability 1. Using the estimates (φ̂1 , ..., φ̂dn ) and the fact that the true
coefficients of the AR(∞) representation converge to zero at an exponential
rate, one obtains (3.2.13). However, the next example illustrates that Lemma
3.2.1 and Corollary 3.2.3 are applicable in more general situations as well.
Example 3.2: Consider the class G2 of all Gaussian processes satisfying (3.2.1)
and (3.2.3) such that for all elements of G2 the following two conditions hold:
a) The corresponding spectral density f has only finitely many zeros, each
of which is of finite order in the sense of the above case 2 and
b) |φk | ≤ Φk , where {Φk }k is an eventually decreasing sequence satisfying

∞
Φ2−
P
k < ∞ for some > 0.
k=1
Note that the orders of the zeros as well as may be different for each element
of G2 . This class comprises G1 as well as Gaussian processes with transfer
functions such as ψ(z) = (1 − z)1/3 := 1 + ∞ k 2·5·...·(3k−4)
P
k=1 ψk z , ψk = − 3·6·...·(3k) , for
σ2 2 λ 1/3
which φk = 1·4·...·(3k−2)

3·6·...·(3k) and f (λ) = 2π 4 sin 2 , the process being purely
nondeterministic by (3.2.5).
With the choice
dn := b(log n)log log nc (for n ≥ 2) and Ln := log n,
(3.2.13) holds for any process in G2 .

To see this, observe that
∞
X ∞
X ∞
X
φ2k ≤ Φdn +1 Φ2−
k ≤ Φdn +1 Φ2−
k = const. · Φdn +1 .
k=dn +1 k=dn +1 k=1
∞
Φ2k < ∞, Olivier’s theorem (Knopp, 1956, 3.3 Theorem 1) allows us
P
From
k=1
to infer kΦ2k −→ 0 (k → ∞), hence
∞
X const.
φ2k ≤
(dn + 1)/2
k=dn +1
for n sufficiently large. Thus (3.2.6) is fulfilled and Lemma 3.2.1 applies. On
the other hand, limn→∞ (log n)log log n n−s = 0 for any s > 0, and we can use
Corollary 3.2.3 to obtain (3.2.13).
3.3 Proof of the approximation and estimation

results
We first turn to the
Proof of Lemma 3.2.1. Note that the transfer function ψ(z) never vanishes
for |z| < 1 (Hida and Hitsuda, 1993, III §3 i). Hence its reciprocal is analytic
for |z| < 1 and can be posed as
∞
1 X
φ(z) = = φk z k (|z| < 1)
ψ(z)
k=0
3.3 Proof of the approximation and estimation results 87
with φ0 = 1. The innovations can be obtained by

∞
X ∞
X
n+1 = φk Yn+1−k = Yn+1 + φk Yn+1−k
k=0 k=1
(Hida and Hitsuda, 1993, III §3 eq. 3.31). We observe that σ(n+1) is in-
dependent of σ(Yn, Yn−1 , ...), where the latter σ-algebra makes ∞
P
k=1 φk Yn+1−k
measurable. Thus,
∞
X
0 = E[n+1|Yn , Yn−1, ...] = E[Yn+1|Yn , Yn−1, ...] + φk Yn+1−k
k=1
and
∞
X
E[Yn+1|Yn , Yn−1, ...] = − φk Yn+1−k . (3.3.1)
k=1
Moreover,

E[Yn+1|Yn , ..., Yn−d+1] = E E[Yn+1|Yn, Yn−1 , ...]Yn , ..., Yn−d+1
d
X ∞
X

=− φk Yn+1−k − E φk Yn+1−k Yn, ..., Yn−d+1 .
(3.3.2)
k=1 k=d+1
(3.3.1) and (3.3.2) imply

2

EE[Yn+1|Yn , Yn−1, ...] − E[Yn+1|Yn , ..., Yn−d+1]
∞ ∞ 2
X X
= EE φk Yn+1−k Yn, ..., Yn−d+1 − φk Yn+1−k
k=d+1 k=d+1
∞ 2 X∞ 2 !
X
= E
φk Yn+1−k − E E
φk Yn+1−k Yn, ..., Yn−d+1

k=d+1 k=d+1
∞ 2
X
≤ E φk Yn+1−k .
k=d+1
Now, set
∞
X
Hd (λ) := φk exp(−ikλ).
k=d+1
Then |Hd (λ)|2 f (λ) is the spectral density of the linear filter ∞
P
k=d+1 φk Yn+1−k
(Brockwell and Davis, 1991, Theorem 4.10.1), and we obtain
∞ 2 Zπ
X
φk Yn+1−k = |Hd (λ)|2 f (λ)dλ

E

k=d+1 −π
Zπ ∞
X
≤ sup f (λ) |Hd (λ)|2 dλ = 2π sup f (λ) |φk |2 .
λ∈[−π,π] λ∈[−π,π] k=d+1
−π
Since the difference E[Yn+1|Yn, Yn−1 , ...]−E[Yn+1|Yn , ..., Yn−d+1] is the probability
limit of the Gaussian variables E[Yn+1 |Yn, ..., Yn−k+1] − E[Yn+1 |Yn, ..., Yn−d+1] as
k → ∞, it is itself Gaussian (Shiryayev, 1984, II §13.5).
We will apply the following lemma on the convergence of Gaussian random
variables that can be found in Buldygin and Dočenko (1977, Lemma 3).
Lemma 3.3.1. Let {Wn }∞ n=0 be a sequence of centered Gaussian random vari-
ables Wn ∼ N (0, σn2 ) with σn2 → 0 (n → ∞). If for every > 0
∞
X
exp − 2 < ∞ (3.3.3)
n=1
σn
then Wn → 0 with probability 1 as n → ∞.
In particular, (3.3.3) is fulfilled if

r
const.
σn2 ≤
log n
for some r > 1 and sufficiently large n.

This follows immediately from Lemma 3.3.1 if we choose 1 < r0 < r and observe
that
∞ X ∞ r
X log n
exp − 2 ≤ exp −
σn const.
n=N n=N
∞ ∞ ∞
X 0 1
(exp (− log n))r =
X 0
X
≤ exp −(log n)r ≤ <∞
nr 0
n=N n=N n=N
for N sufficiently large.

Now, by Lemma 3.3.1 and with Mf := 2π supλ∈[−π,π] f (λ), the inequality

Var E[Yn+1|Yn , Yn−1, ...] − E[Yn+1|Yn , ..., Yn−d+1]
∞ r
X const.
≤ Mf |φk |2 ≤
log n
k=dn +1
for some r > 1 and n sufficiently large implies

E[Yn+1|Yn, Yn−1 , ...] − E[Yn+1|Yn, ..., Yn−d+1] → 0

with probability 1, and the proof of the lemma is finished. 2

Remark. In order to obtain convergence of the Wn to zero with probability 1,
one has to impose some condition on the rate of decay of σn2 . The conditions
of Lemma 3.3.1 cannot be substantially weakened. Indeed, if we consider any
sequence of independent random variables Wn ∼ N (0, σn2 ) with σn2 = (lb n)−1 ,
such a sequence cannot converge to zero with probability 1 (lb denoting the
logarithm for base 2): In fact, the lower bound on 1 − Φ in Feller (1968, VII.1.
Lemma 2) gives

1 4 16 1 1
P(|Wn | ≥ ) > √ − √ √ exp − lb n =: qn .
2 lb n lb n lb n 2π 8
P
The qn constitute an eventually positive, decreasing sequence. Thus, qn has
P n
the same convergence properties as 2 q2n . The latter series diverges because
of
1/n
n 1/n 4 16 1 1 1
(2 q2n ) = 2 √ − √ 1/2n
exp − → 2 exp − > 1.
n n n (2π) 8 8
Keeping this in mind, assume we had Wn → 0 with probability 1. With the

characteristic function f (y) := 1[−1/2,1/2]C (y), this yields independent random
variables f (Wn ) → f (0) = 0 with probability 1. Since f (Wn ) is {0, 1}-valued,
convergence to zero on a set of probability 1 implies (Shiryayev, 1984, II §10
Example 3)
∞ X ∞
X 1
P f (Yn) = 1 = P |Wn | ≥ < ∞,
n=1 n=1
2
P
a contradiction to the divergence of qn .
For the proof of Theorem 3.2.2 we need some preliminary observations. First
we recall some simple facts from the theory of matrix norms. Let k · k be some
vector norm on IRd . By k · k also denote the corresponding matrix norm
kAyk
kAk := sup = sup kAyk
y6=0 kyk kyk=1
for A ∈ IRd×d . The spectrum of A is the collection of the moduli of the eigen-
values of A,

spr(A) := |λ|λ eigenvalue of A ,
and the spectral radius of A is
ρ(A) := max spr(A).
Recall the following inequalities (Isaacson and Keller, 1994, Corollaries in Sec.
1.1 and 1.3):
Lemma 3.3.2. Let A, B, C ∈ IRd×d .

1. If kAk < 1, then (I − A)−1 exists and we have
1
k(I − A)−1 k ≤ .
1 − kAk
2. If B and C are non-singular with kI − B −1 Ck < 1, the following inequalities

hold:
kB −1 k
kC −1 k ≤ ,
1 − kI − B −1 Ck
kB −1 k kI − B −1 Ck
kC −1 − B −1 k ≤ .
1 − kI − B −1 Ck
Proof. 1. Existence of (I − A)−1 follows from the well known Neumann series.
(I − A)−1 = I + A(I − A)−1 implies k(I − A)−1 k ≤ 1 + kAk k(I − A)−1 k, or
using kAk < 1
1
k(I − A)−1 k ≤ .
1 − kAk
2. Inversion of B −1 C = I + B −1 C − I yields C −1B = (I + B −1 C − I)−1 .
Multiplying by B −1 from the right, we obtain C −1 = (I − (I − B −1 C))−1 B −1
and using the first part of the lemma

kB −1 k
kC −1 k ≤ kB −1 k k(I − (I − B −1 C))−1 k ≤ .
1 − kI − B −1 Ck
Finally, pose C −1 − B −1 = (I − B −1 C)C −1 and obtain
kB −1 k kI − B −1 Ck
kC −1 − B −1 k ≤ kI − B −1 Ck kC −1k ≤ . 2
1 − kI − B −1 Ck
We will use the vector norms kyk2 := ( di=1 yi2 )1/2 or kyk∞ := maxi=1,...,d |xi|,
P
and the corresponding matrix norms kAk2 := ρ(AT A)1/2 and kAk∞ := maxi=1,...,d
Pd
j=1 |aij |, respectively. For the latter norm we have
Lemma 3.3.3. For any symmetric and non-singular matrix A ∈ IRd×d

d1/2
kAk∞ ≤ .
min spr(A−1 )
Proof. Let y = (y1 , ..., yd) ∈ IRd . Then kyk∞ ≤ kyk2 ≤ (d maxi yi2 )1/2 =
d1/2 kyk∞ and
kAyk∞ kAyk2
kAk∞ = sup ≤ d1/2 sup = d1/2kAk2 = (dρ(AT A))1/2
y6=0 kyk ∞ y6=0 kyk2
1
= (dρ(A2 ))1/2 = (dρ(A)2 )1/2 = d1/2 ρ(A) = d1/2 . 2
min spr(A−1 )
The proposed estimation procedure requires convergence of γ̂n(k) to γ(k). Doob

(1953, X §7 Theorem 7.1) gave the key result for this when proving that for
a real-valued, centered, stationary and ergodic Gaussian process γ̂n(k) → γ(k)
with probability 1 as n → ∞ is equivalent to either one of the conditions (a)
1
Pn 2
n+1 k=0 |γ(k)| → 0 as n → ∞ or (b) the spectral distribution function F is
continuous. In the present analysis, a more precise result featuring the rate of
convergence will be needed. This important result was given by An, Chen and
Hannan (1982, Theorem 1):
Lemma 3.3.4. Let {Yn}∞ n=1 be a stationary ergodic process with zero mean
and finite variance, allowing for a representation
∞
X ∞
X
Yn = ψj n−j with ψ0 = 1, |ψj | < ∞
j=0 j=0
and innovations n satisfying
E[n|m , m < n] = 0,
E[2n|m , m < n] = σ2,
E(|n|r ) < ∞ for some r ≥ 4.
Then for any δ > 0 and dn ≤ nr/(2(r−2))
lim rn max |γn(k) − γ̂n(k)| = 0

n→∞ 0≤k≤dn
with probability 1, where

n1/2
rn := .
(dn log n)2/r (log log n)2(1+δ)/r
In the “Gaussian case” considered here, the conditions on the innovations are
fulfilled, since the n ’s are independent, identically N (0, σ2)-distributed, hence
E[n |m , m < n] = En = 0 and E[2n|m , m < n] = E(2n ) = σ2.
After these introductory remarks, we turn to the core of the
n
Proof of Theorem 3.2.2. Ȳn−d+1 := is a shorthand notation for the d-past
of the process (Yn−d+1 , ..., Yn)T , and we set TLȲn−d+1n
:= (TLYn−d+1 , ..., TLYn)T .
The prediction error can be decomposed into

Êd,n − E[Yn+1|Yn , ..., Yn−d+1] = γ̂d,n Γ̂−1 TL Ȳn−d+1
n −1 n

d,n n
− γd Γd Ȳn−d+1

−1 −1 n −1 n n

≤ (γ̂d,n Γ̂d,n − γd Γd )TLn Ȳn−d+1 + γd Γd (TLn Ȳn−d+1 − Ȳn−d+1 )

≤ dLnkγd Γ−1 −1 −1 n n
d − γ̂d,n Γ̂d,n k∞ + dkγd k∞ kΓd k∞ kTLn Ȳn−d+1 − Ȳn−d+1 k∞ . (3.3.4)
Convergence of the first term in (3.3.4): Observe that
kI − Γ−1 −1 −1
d Γ̂d,n k∞ = kΓd (Γd − Γ̂d,n )k∞ ≤ kΓd k∞ kΓd − Γ̂d,n k∞
d
−1
X δik
= kΓd k∞ max γ(i − k) − γ̂d,n (i − k) − n

i=1,...,d
k=1
d
X 1
≤ kΓ−1
d k∞ max |γ(i − k) − γ̂d,n (i − k)| +
i=1,...,d n
k=1

−1 1
≤ kΓd k∞ d max |γ(i) − γ̂d,n(i)| + . (3.3.5)
i=0,...,d−1 n
From the An, Chen and Hannan (1982) result (Lemma 3.3.4), this tends to 0
with probability 1, if only (recall d = dn ≤ nr/(2(r−2)))
kΓ−1
dn k∞ dn
= O(1) (3.3.6)
rn
and
kΓ−1
dn k∞
→ 0 (n → ∞). (3.3.7)
n
Thus, for all ω ∈ Ω from a set of probability 1, for n sufficiently large
kI − Γ−1
d Γ̂d,n k∞ (ω) < 1
and by the second part of Lemma 3.3.2

kΓ−1
d k∞
kΓ̂−1
d,n k∞ ≤ ,
1 − kI − Γ−1
d Γ̂d,n k∞
kΓ−1 −1
d k∞ kI − Γd Γ̂d,n k∞
kΓ̂−1 −1
d,n − Γd k∞ ≤ .
1 − kI − Γ−1
d Γ̂d,n k∞
Using those inequalities, it follows that
kγd Γ−1 −1 −1 −1 −1 −1
d − γ̂d,n Γ̂d,n k∞ ≤ kγd Γd − γd Γ̂d,n k∞ + kγd Γ̂d,n − γ̂d,n Γ̂d,n k∞
≤ kγdk∞ kΓ̂−1 −1 −1
d,n − Γd k∞ + kγd − γ̂d,n k∞ kΓ̂d,n k∞
kI − Γ−1
d Γ̂d,n k∞ kγd − γ̂d,nk∞
≤ kγdk∞ kΓ−1
d k∞ + kΓ−1
d k∞ .
1 − kI − Γ−1
d Γ̂d,n k∞ 1 − kI − Γ−1
d Γ̂d,n k∞
For any second order stationary process, there exists a constant cγ with
0 ≤ |γ(k)| ≤ cγ < ∞ (k ∈ IN0 ),
hence kγd k∞ ≤ cγ (this can also be seen from (3.2.7)). We set Md,n :=
maxi=0,...,d |γ(i) − γ̂n (i)|. Now, appealing to (3.3.5) again and tidying things
up, we obtain
kΓ−1 −1 −1 2
d k∞ (cγ kΓd k∞ d + 1)Md,n + kΓd k∞ cγ /n
kγd Γ−1 −1
d − γ̂d,n Γ̂d,n k∞ ≤ −1 −1
1 − (kΓd k∞dMd,n + kΓd k∞/n)
(for all ω ∈ Ω from a set of probability 1 and for n ≥ N (ω)). Finally, according
to Lemma 3.3.4, dLnkγdΓ−1 −1
d − γ̂d,n Γ̂d,n k∞ → 0 with probability 1 if
Ln kΓ−1 2 2
dn k∞ dn
= O(1), (3.3.8)
rn
Ln kΓ−1 2
dn k∞ dn
→ 0. (3.3.9)
n
Convergence of the second term in (3.3.4): dkγdk∞ kΓ−1 n n

d k∞ kTLn Ȳn−d+1 −Ȳn−d+1 k∞
−1 n n
is readily bounded from above by dcγ kΓd k∞ kTLn Ȳn−d+1 − Ȳn−d+1 k∞ , so it suf-
fices to ensure that (again d = dn )
dkΓ−1 n n
d k∞ kTLn Ȳn−d+1 − Ȳn−d+1 k∞ → 0
with probability 1 as n → ∞. To this end, for > 0,
P dkΓ−1 n n

d k∞ kTLn Ȳn−d+1 − Ȳn−d+1 k∞ ≥

= P dkΓ−1 d k ∞ max |T Y
Ln n−i − Y n−i | ≥
i=0,...,d−1

≤ P (∃i ∈ {0, ..., d − 1} : |Yn−i| ≥ Ln ) = P max |Yn−i | ≥ Ln
i=0,...,d−1
since |TLn Yn−i − Yn−i | > 0 implies |Yn−i | ≥ Ln. But

Ln
P max |Yn−i| ≥ Ln ≤ dP (|Y1| ≥ Ln ) = 2d 1 − Φ .
i=0,...,d−1 σ
Here Φ denotes the distribution function of the standard normal distribution.
Application of the standard bound on Φ,

1 1 2
1 − Φ(y) < √ exp − y (y > 0),
2πy 2
(Feller, 1968, VII.1. Lemma 2) yields

σ 1 2
dkΓ−1 n n

P d k∞ kTLn Ȳn−d+1 − Ȳn−d+1 k∞ ≥ ≤ 2d √ exp − 2 Ln .
2πLn 2σ
From this
dkγdk∞ kΓ−1 n n
d k∞ kTLn Ȳn−d+1 − Ȳn−d+1 k∞ → 0
with probability 1, if only

∞
X dn 1 2
exp − 2 Ln < ∞. (3.3.10)
L
n=1 n
2σ
Putting things together: Apart from dn ≤ nr/(2(r−2)), the conditions to be fulfilled

are (3.3.6) – (3.3.10). From (3.2.3) and (3.2.4), the spectral density f of the
process is continuous, which yields 0 < supλ∈[−π,π] f (λ) < ∞. With the standard
bound on the spectrum of Γd (Brockwell and Davis, 1991, Prop. 4.5.3), one has
1 1
kΓ−1 −1
d k∞ ≥ ρ(Γd ) = ≥ .
min spr(Γd ) 2π supλ∈[−π,π] f (λ)
Hence (3.3.6) and (3.3.7) are implied by (3.3.8) and (3.3.9), respectively. Rewrit-
ing (3.3.8) as
Ln kΓ−1 2 2(r+1)/r
dn k∞ dn = O(1),
n1/2
we end up with the following four conditions
dn ≤ nr/(2(r−2)),
Ln kΓ−1 2 2(r+1)/r
dn k∞ dn
· = O(1),
n1/2
Ln kΓ−1 2
dn k∞ dn
→ 0 (n → ∞),
n
∞
X dn
1
exp − 2 L2n < ∞
L
n=1 n
2σ
in order to obtain

lim E[Yn+1 |Yn, ..., Yn−dn+1 ] − Êdn,n = 0
n→∞
with probability 1. 2
Finally, we prove Corollary 3.2.3.
Proof of Corollary 3.2.3. Recently, Serra Capizzano obtained the following
result (1999, Theorem 3.2; 2000, Theorem 1.2) on the “worst” rate of decay for
the minimal eigenvalue µmin
d of Toeplitz matrices
Td (f ) := (γ(i − j))i,j=1,...,d
formed by the coefficients of the Fourier expansion
∞
1 X
f (λ) = γ(j) exp(−ijλ) (λ ∈ [−π, π])
2π j=−∞
of some real-valued Lebesgue integrable function f . As to the situation of case

3, the following holds:
Lemma 3.3.5. Let f be a real-valued Lebesgue integrable function with es-

sential infimum mf . Suppose there exists an interval (a, b) ⊆ (−π, π), a < b,
and a number δ > 0 such that f (λ) > δ for almost all λ ∈ (a, b). Then
µmin
d ≥ K exp(−cd) + mf (3.3.11)
with some c > 0 and some K > 0 independent of d.
The constants c and K are related to the measure of the set where f essentially
vanishes, not disclosed to the statistician. However, choosing some 0 < s < 1,
one has
K exp(−cd) + mf ≥ exp(−d1/s ) (3.3.12)
for sufficiently large d. As already noted, from (3.2.3) and (3.2.4), the spectral
density f of the process under consideration is continuous. Thus the require-
ments of Lemma 3.3.5 are met for Td (f ) = Γd , and Lemma 3.3.3 together with
(3.3.11) and (3.3.12) yields
d1/2
kΓ−1
d k∞ ≤ ≤ d1/2 exp(d1/s).
µmin
d
It remains to check the conditions (3.2.11) of Theorem 3.2.2 in the case of

q > 4, 0 < s < 1, t ≥ 1,
s
1
dn ≤ log n and Ln = (log n)t .
q
The first inequality in (3.2.11) is obvious, the second and third follow from
kΓ−1 2
dn k∞ dn exp(2(dn)1/s )
≤ = dn n2/q−1/2
n1/2 n1/2
with 2/q − 1/2 < 0. As to the fourth inequality, observe that dn/Ln → 0 as
n → ∞ and for n sufficiently large −L2n /(2σ 2) ≤ −2 log n, hence

1 2 1
exp − 2 Ln ≤ 2 .
2σ n
The lemma for the more restrictive cases 1 and 2 follows similarly, using
µmin
d ≥ 2π inf f (λ) > 0
λ∈[−π,π]
(Brockwell and Davis, 1991, Proposition 4.5.3) for case 1 and

const.
µmin
d ≥
dp∗
(Böttcher and Grudsky, 1998, Example 3.1 and Theorem 3.4; Serra Capizzano,
2000, Remark 1.2; less general in Serra, 1998, Theorem 2.3) for case 2. 2
3.4 Simulations and examples 97
3.4 Simulations and examples

We first continue Examples 3.1 and 3.2 from Section 3.2.
Example 3.1 (continued): Consider the AR(1)-processes
Yt − ΦYt−1 = Zt Zt white noise with variance σ 2
with |Φ| < 1 (Brockwell and Davis, 1991, Example 4.4.2). The processes have
the spectral densities
σ2
f (λ) = (1 − 2Φ cos λ + Φ2 )−1 ,
2π
bounded away from zero. The autocovariance function is given by
k
Φ2 + 1 2

2Φ
γ(0) = 2 σ and γ(k) = γ(0),
(Φ − 1)2 1 + Φ2
from which we can calculate the conditional expectation E[Yn+1|Yn , ..., Y1] of the
next output given the past using (3.2.9). The latter conditional expectation acts
as an approximation of the unknown true autoregression E[Yn+1|Yn , Yn−1, ...].
Figure 3.1 a-b) shows two paths of the process (circles) for different values
of σ 2 and Φ together with the corresponding autoregression (grey) and the
estimated autoregression (black). The convergence of the estimates towards
the true autoregression is clearly visible.
Example 3.2 (continued): The process from Example 3.2 in Section 3.2 has
a spectral density
1/3
σ2

λ
f (λ) = 4 sin2 .
2π 2
From this we calculate the autocovariances
Zπ
γ(k) = f (λ) cos(kλ)dλ
0
(Brockwell and Davis, 1991, eq. 4.3.10) using a compound trapezoidal integra-
tion rule with an error of at most 10−7 . Figure 3.2 a) shows the autocovariance
function of the process for a variance σ2 = 0.01 of the innovations. Again,
E[Yn+1|Yn , Yn−1, ...] is approximated by E[Yn+1 |Yn, ..., Y1], which is calculated by
(3.2.9). The Gaussian process {Yn}n itself is simulated by the method described
3
−1
−2
−3
0 5 10 15 20 25 30 35 40 45 50
3.1 a) The AR(1) process from Example 3.1 with σ 2 = 1, Φ = −0.3.

2
1.5
0.5
−0.5
−1
−1.5
−2
−2.5
−3
0 5 10 15 20 25 30 35 40 45 50
3.1 b) The AR(1) process from Example 3.1 with σ 2 = 1, Φ = 0.3.
Figure 3.1: “True” (grey) and predicted (black) autoregression for two Gaussian
AR(1) models.
0.013
0.000
autocovariance
-0.001
-0.002
0 10 20 30 40 50
lag
3.2 a) The autocovariance function of the process.
0.4
0.3
0.2
0.1
−0.1
−0.2
0 5 10 15 20 25 30 35 40 45 50
3.2 b) “True” (grey) and predicted (black) autoregression.
Figure 3.2: The autocovariance function and a sample path of the process in
Example 3.2.
in Brockwell and Davis (1991, Ex. 8.16, p. 271). Figure 3.2 b) shows the reali-
sations of Yn (circles), the “true” (grey) and the predicted (black) autoregression
for 50 consecutive days. Again, the convergence result is convincing.
The following two examples illustrate the performance of the greedy strategy
from Section 3.1.
Example 3.3: We run the greedy strategy from Section 3.1 in a market with
a stock whose price follows a geometrical Brownian motion (Luenberger, 1998,
Sec. 11.7; Korn and Korn, 1999, Ch. 2) with a mean return of 2% p.a. and
a volatility of σ = 10% p.a.. The bond offers a riskless return of 2% p.a. (i.e.
we set r = 2%/365). The algorithm of Section 3.2 is used to predict the log-
returns of the stocks. Figure 3.3 shows the value of an investment of $1 either
solely in the stock (grey, solid) or in the bond (grey, dashed). The value of the
greedy strategy is shown by the solid black line. In times when the share price
is likely to plummet the investor takes refuge in the bond. Thus he participates
1.06
1.04
1.02
0.98
0 20 40 60 80 100 120 140 160 180 200
Figure 3.3: Daily value of a $1 investment in a stock following a geometrical

√
Brownian motion with µ = 2%/365, σ = 10%/ 365 (grey, solid), in a bond
with short rate 2% p.a. (grey, dashed) and in a greedy strategy (black).
1.5
1.4
1.3
1.2
1.1
0.9
0.8
0.7
0.6
0.5
0 20 40 60 80 100 120 140 160 180 200
3.4 a) Yellow Corp. (YELL)
1.2
0.8
0 20 40 60 80 100 120 140 160 180 200
3.4 b) SONY (SNE).

1.2
1.1
0.9
0.8
0.7
0.6
0 20 40 60 80 100 120 140 160 180 200
3.4 c) Boeing Co. (BA).
Figure 3.4: Daily value of a $1 investment in some shares from Dow Jones
Indices at NYSE 24/4/1998-8/2/1999 (grey, solid), in a bond with short rate
2% p.a. (grey, dashed) or in a greedy strategy (black), respectively.
in the rise of the share price more than in its fall, increasing his annual yield
beyond 2%. As in Section 2.2 this is the phenomenon of volatility pumping
(Luenberger, 1998, Examples 15.2 and 15.3). The share’s volatility is used to
draw an above average return from the stock.
Example 3.4: We replace the geometrical Brownian motion by various real
stock price processes on 200 days of trading, using the NYSE closing prices
24/4/1998-8/2/1999 (data from www.wallstreetcity.com). Figure 3.4 a-c) shows
the corresponding charts. Although the greedy strategy does not manage to
yield a return at least as large as the bond’s return in all cases (Fig. 3.4 b),
it typically outperformed the stock, considerably reducing the investor’s risk of
financial loss from pure share investment.
103
CHAPTER 4
A Markov model with transaction costs:

probabilistic view
In Chapter 1 (Theorem 1.3.4) we have seen that investment according to the
log-optimal portfolio is optimal in an asymptotic and a non-asymptotic sense.
In an m stock market with i.i.d. returns for example, the log-optimal portfolio
is a constant b∗ ∈ S. This, however, does not mean that once the investor
allocated his wealth according to b∗, he need not rebalance his portfolio in
the following market periods. On the contrary, since the price of each stock
evolves in a different way, the proportions of wealth held in the stocks will dif-
fer from b∗ already after the next market period. Thus, selling and purchasing
stocks becomes necessary after basically each market period. In a market where
transactions such as selling and buying generate transaction costs, one should
therefore adopt a more carefully chosen strategy, combining the task of maxi-
mizing the portfolio return with the task of having to pay as little transaction
fees as possible. In the setting of a Markov market model, this chapter gives an
optimal solution to this problem.
In Section 4.1 we will set up a market model with transaction costs and for-
mulate our investment goals. The chapter is devoted to probabilistic aspects
of the model, i.e. the analysis assumes the distribution of the return process
to be known. Statistical aspects, in particular what to do if the distribution
is not revealed to the investor, will be investigated in Chapter 5. In Section
4.2 we propose an optimal strategy (Theorem 4.2.1) whose optimality is proven
in the remainder of the section. Section 4.3 concludes the chapter with results
that will be needed in Chapter 5 when dealing with the statistical aspects of
the model.
Historically, the necessity of modelling investment problems including trans-
action fees arose at a time when most research dealt with continuous-time
104 Chapter 4. A Markov model with transaction costs: probabilistic view
models, and consequently, most of the work on transaction costs concerns the
continuous-time case. Not much emphasis was put on statistical discrete-time
strategy building. A detailed overview of literature can be found in Davis and
Norman (1990), Fleming (1999), Cadenillas (2000) or Bielecki and Pliska (2000).
However, many models and strategies involved considerable computational ef-
fort, which made it necessary to use approximation techniques (see e.g. Fitz-
patrick and Fleming, 1991, and Atkinson et al., 1997). Nowadays, continuous-
time models are frequently approximated by time-discrete models (for example
in Bielecki, Hernandez and Pliska, 1999, to discretize the continuous-time model
in Bielecki and Pliska, 1999). This brought about a renaissance of discrete-time
modelling. Up to date surveys of strategy planning under transaction costs
in discrete-time markets can be found in Carassus and Jouini (2000) for a
Cox-Rubinstein type model, in Blum and Kalai (1999) for optimal constant-
rebalanced portfolios and in Bobryk and Stettner (1999) for a market with a
bond and one stock having i.i.d. returns.
4.1 Strategies in markets with transaction fees

In this section we set up a market with a bond and several stocks whose returns
form a d-stage Markov process. Markets of this kind arise when discretizing
markets driven by stochastic differential equations even beyond the famous
Black-Scholes model. Consider a market of m stocks and a risk-free bond (which
we will think of as a bank account). Again, Xi,j denotes the return of the jth
stock from time i − 1 to time i and r ≥ 0 is the interest rate of the bond.
The return process {Xi := (Xi,1 , ..., Xi,m)T }∞
i=−d+1 is assumed to be a d-stage
Markov process with continuous autoregression, i.e., if we denote the last
d observed returns by X̄i+1 = (Xi−d+1 , ..., Xi), the process satisfies the following
conditions:
V1: {Xi }∞ m
i=−d+1 is a stationary [A, B] -valued stochastic process on a proba-
bility space (Ω, A, P) (0 < A < B < ∞ need not be known),
V2: E[h(b, X̄i+1 )|Fi ] = E[h(b, X̄i+1)|X̄i ] P − a.s.,
V3: E[h(b, X̄i+1 )|X̄i = x̄] is a continuous function of (b, x̄) ∈ S × [A, B]dm ,
4.1 Strategies in markets with transaction fees 105
for all continuous functions h : S × [A, B]dm −→ IR and all i. Thus, at time
i − 1, the further evolution of the market depends upon a sub-σ-algebra σ(X̄i)
of the total information field Fi := σ(X−d+1 , ..., Xi−1).
Following Blum and Kalai (1999) we assume that only purchasing shares in the
market generates transaction costs (brokerage fees, commission) proportional
to the total value of the transaction, i.e.
transaction costs = c · value of purchased shares
with a commission factor c ∈ [0, 1). Paying into and drawing money from
the risk-free account does not generate any fees. In case two commission fac-
tors apply for selling and purchasing shares, say csell and cpurchase, one may use
c := (csell + cpurchase)/(1 + cpurchase) as a compound commission factor applying to
purchases only. This approach never underestimates the capital reducing effect
of transaction costs. Indeed, with 1 unit of money in a stock one can purchase
(1 − csell )/(1 + cpurchase) = 1 − c value in another stock or pay 1 − csell ≥ 1 − c
units of money into the account. Conversely, 1 unit of money in the bond can
purchase 1/(1 + cpurchase) ≥ 1 − c value in a stock.
To see how investment actions are limited by transaction fees, consider a fixed
time instant i. Then the investor’s wealth Wi is used to acquire a new portfolio,
given by an enhanced portfolio vector bi+1 := (bi+1,−1 , ..., bi+1,m)T , which now is
(m + 2)-dimensional. Here
bi+1,−1 is the proportion of Wi needed to settle the transaction costs that arise
when the portfolio is restructured,
bi+1,0 is the proportion of Wi to be held in the bond and, as usual,
bi+1,j is the proportion of Wi to be held in stock j (j = 1, ..., m).
No short selling or consumption is considered, i.e. m

P
j=−1 bi+1,j = 1 and bi+1,j ≥
0. Thus, the portfolio vector bi+1 chosen at time i becomes a member of the
simplex S := {b = (b−1 , ..., bm)T ∈ IRm+2 |bj ≥ 0 for all j, m
P
j=−1 bj = 1}.
Now, in the market period from time i − 1 to time i the investor’s wealth Wi−1
generated a value of (1 + r)bi,0 Wi−1 in the bond and of Xi,j bi,j Wi−1 in the jth
stock. An amount of bi,−1 Wi−1 was used to settle transaction fees and is no
longer available. The resulting wealth at time i becomes Wi = (1 + r)bi,0 Wi−1 +

Pm
j=1 Xi,j bi,j Wi−1 , or equivalently
m
Wi X
= (1 + r)bi,0 + Xi,j bi,j . (4.1.1)
Wi−1 j=1
Rebalancing the portfolio bi to bi+1 generates transaction costs of total amount

c m +
P
j=1 (bi+1,j Wi − Xi,j bi,j Wi−1 ) which are settled using the amount bi+1,−1 Wi .
Hence the investor has to observe the self-financing condition
m
X
bi+1,−1 Wi = c (bi+1,j Wi − Xi,j bi,j Wi−1 )+ . (4.1.2)
j=1
Using (4.1.1), (4.1.2) is equivalent to

m
!
X
gc (bi , X̄i+1 , bi+1) := bi+1,−1 (1 + r)bi,0 + Xi,k bi,k
k=1
m m
! !+
X X
−c bi+1,j (1 + r)bi,0 + Xi,k bi,k − Xi,j bi,j = 0. (4.1.3)
j=1 k=1
If x̄ is the matrix formed from the last d observed return vectors and s denotes
the last portfolio vector, we call the collection of all portfolios satisfying the
self-financing condition,
S(s, x̄) = {b ∈ S | gc (s, x̄, b) = 0}, (4.1.4)
the admissible set corresponding to (s, x̄) ∈ S × [A, B]dm . Note that for all
(s, x̄) ∈ S × [A, B]dm
a∗ := (0, 1, 0, ..., 0)T ∈ S(s, x̄), (4.1.4a)
i.e. there is always one option open to the investor: He can pay all his wealth
into the risk-free account at any time.
The investor can only follow non-anticipating portfolio strategies which comply
with the self-financing condition:
Definition 4.1.1. A sequence {bi }∞i=0 of random variables Ω → S is called

admissible portfolio strategy if for all i ∈ IN the following holds:
1. bi ∈ S(bi−1 , X̄i ) P-a.s.,

4.1 Strategies in markets with transaction fees 107
2. bi is Fi -measurable and
3. b0 = a∗ .
Condition 1 enforces that an admissible strategy never generates more transac-

tion fees than affordable. Because of condition 2, investment decisions require
no more information than currently available. Finally, condition 3 provides for
a standardized setting in so far as the investor’s wealth is accumulated in the
bond at the beginning of the investment process.
As we have seen, the pair (bi−1, X̄i ) carries complete information about both, the
stochastic regime of the next market period (from the d-stage Markov property)
and about the admissible set (from the self-financing condition (4.1.3)). Hence,
at time i − 1, the investment decision may be taken by applying a so-called
portfolio selection function φ : S × [A, B]dm −→ S to the pair (bi−1 , X̄i ). This
approach wastes no information.
Definition 4.1.2. An admissible portfolio strategy {bi}∞

i=0 is based on the
dm
portfolio selection function φ : S × [A, B] −→ S if φ is measurable and,
for all i,
bi = φ(bi−1 , X̄i ) P − a.s..
We now move on to defining the investment goal. From the previous chapters
we know that the logarithmic utility function f : S × ([A, B]m )d −→ IR,
f (b, X̄i+1 ) = log (0, 1 + r, XiT )b

is the optimal choice for long-run and short-term investment targets alike (note
that the entry 0 in the first vector of the scalar product corresponds to the
amount of transaction costs which is lost). In this chapter we therefore assume
that the investor aims to choose an admissible strategy {bi}i such that, in the
long run, the expected mean utility E( n1 n−1
P
i=0 f (bi , X̄i+1 )) is larger than for any
other strategy based on some portfolio selection function. This is formalized by
inequality (4.2.4) below.
It should be pointed out that the process {Xi }i need not contain mere return
information, but may contain additional factors and side information in
some of its coordinates as well. – Provided of course that these occur in the
form of a d-stage Markov process with continuous autoregression function and

that the joint vector of returns and factors satisfies V1, V2 and V3. The utility
function f then simply ignores the coordinates containing the factors and side
information.
4.2 An optimal strategy

For the rest of this chapter, assume that V1, V2 and V3 hold and that a (0, 1)-
valued sequence of discount factors {δi }∞
i=1 is fixed for which
δi & 0 monotonic decreasing as i → ∞,

iδi −→ ∞ as i → ∞,
δi piecewise constant: δ1 = δ2 , δ3 = δ4 = δ5 , δ6 = ... = δ9 , ...,
2
(δi − δi+1 )/δi+1 ≤ 1 for all i ≥ 1. (4.2.1)
Then the following theorem gives an algorithm for optimal investment based on
the so-called Bellman equation. Using stationarity, let
m(b, x̄) := E[f (b, X̄i+1 )|X̄i = x̄] = E[f (b, X̄d+1)|X̄d = x̄].
Theorem 4.2.1. Let hi ∈ C(S × [A, B]dm ) be a solution of the Bellman equa-
tion

hi (s, x̄) = max m(b, x̄) + (1 − δi )E[hi(b, X̄d+1 )|X̄d = x̄] . (4.2.2)
b∈S(s,x̄)
With Vi (b, x̄) := m(b, x̄) + (1 − δi )E[hi(b, X̄d+1 )|X̄d = x̄] we obtain an admissible
portfolio strategy by
b∗0 := a∗
b∗i := arg max Vi (b, X̄i). (4.2.3)
b∈S(b∗i−1 ,X̄i )
This strategy is optimal in the sense that for any portfolio strategy {bi}i based
on a portfolio selection function
n−1 n−1
!
1X 1 X
lim inf Ef (b∗i , X̄i+1 ) − Ef (bi , X̄i+1 ) ≥ 0. (4.2.4)
n→∞ n i=0 n i=0
Before proving the theorem we make some remarks:

4.2 An optimal strategy 109
1. (4.2.4) compares the mean utility of the investment schemes {b∗i }i and
{bi }i in the worst case (lim inf) that may occur for remote time horizon.
Hence, the result of Theorem 4.2.1 is a worst case analysis. The Bellman
equation (4.2.2) penalizes the “target” function m to obtain a target that
is adjusted to loss through transaction costs. It thus balances the need to
make as many transactions as needed, but, at the same time, as few as
possible.
2. The strategy b∗i is a generalization of the log-optimal strategy in Chapter 1.

If no transaction costs occur, then c = 0 and S(s, x̄) = {(0, b0, ..., bm)|bj ≥
0, j bj = 1} independent of s. Hence, hi (s, x̄) = hi (x̄), and b∗i =
P
arg maxb∈S(b∗i−1 ,X̄i ) m(b, X̄i) coincides with the classical log-optimal strat-
egy.
3. In dynamic programming, a solution hi of the Bellman equation (4.2.2)

is frequently referred to as value function. The existence of a value
function for δi = 0 can be obtained under more restrictive extra conditions
such as a finite state space of the Markov chain and certain recurrence
properties of the transition matrix (Ross, 1970, Sec. 6.7 and Bertsekas,
1976, Sec. 8.1). To avoid these extra coditions we use a variant of the
so-called vanishing discount approach (Hernández-Lerma and Lasserre,
1996, Sec. 5.3 - 5.5) where solutions of the Bellman equation (4.2.2) are
produced for a sequence of discount factors δi → 0.
4. A sequence {δi }i satisfying the above conditions can be obtained recur-

sively by
1 p
d1 ∈ (0, 1) arbitrary, dk+1 := 1 + 4dk − 1
2
and
δ1 = δ2 := d1
δ3 = δ4 = δ5 := d2
δ6 = δ7 = δ8 = δ9 := d3
δ10 = ....
Note that dk coverges monotonically decreasing to 0, that k(k +1)dk → ∞

(k → ∞) and that (dk − dk+1)/d2k+1 = 1.
The Bellman equation is closely linked with the theory of Markov control pro-
cesses and stochastic dynamic programming (SDP). These have been applied to
financial mathematics since the 1960s, e.g. by Samuelson (1969), Merton (1969)
and Bertsekas (1976, Sec. 3.3). A good introduction to SDP and Markov control
can be found in Bertsekas (1976), Bertsekas and Shreve (1978), in Hernández-
Lerma and Lasserre (1996 and 1999) or in Bather (2000). Among recent appli-
cations to discrete-time finance we only mention Hernández-Lerma and Lasserre
(1996, Example 1.3.2) and Duffie (1988, Sec. III.19) – both contain more refer-
ences. It should be pointed out, however, that none of the classical models can
properly deal with the transaction cost problem we are dealing with. Before
giving details of the proof of Theorem 4.2.1, we should briefly comment on that.
4.2.1 Some comments on Markov control

In the terminology of Hernández-Lerma and Lasserre (1996 and 1999), a discrete-
time Markov-control process is a five-tuple (X, A, {A(x)|x ∈ X}, Q, c), with
state space X and a set A of control actions. {A(x)|x ∈ X} is a class of
nonempty sets A(x) ⊆ A, where A(x) contains the admissible control actions
in state x. xt denotes the state of the system at time t. Q is the transition
probability distribution Q(dy|x, a) = Pxt+1 |xt =x,at =a (dy), i.e. the distribution of
xt+1 given the system is in state xt = x at time t and control action at = a is
taken. Finally, c is the one step cost function, c(x, a) being the cost incurred
when choosing control action a in state x.
One then seeks for a sequential choice of control actions ai ∈ A(xi ) such as to
minimize
n−1
1X
lim sup E c(xi, ai )
n→∞ n i=0
or maximize
n−1
1X
lim inf E (−c)(xi, ai )
n→∞ n i=0
(−c thus becomes the utility function). Optimal strategies can be generated
from solutions (ρ∗ , h) of the Bellman equation
Z
∗
ρ + h(x) = min c(x, a) + h(y)Q(dy|x, a)
a∈A(x) X
with ρ∗ ∈ IR and continuous function h. In order to solve this equation, ei-

ther appropriate boundedness and continuity properties of the solutions of the
discounted Bellman equation
Z
∗
ρ + h(x) = min c(x, a) + δ h(y)Q(dy|x, a)
a∈A(x) X
(0 < δ < 1) are needed (Hernández-Lerma and Lasserre, 1996, Theorems 5.4.3
and 5.5.4) or recurrence and irreducibility conditions for the Markov chain with
the transition probability distribution Q (Ross, 1970, Cor. 6.20 or Bertsekas,
1976, Prop. 3). Typically, most research assumes the state space X to be count-
able and the additional existence of a state x∗ with Q(x∗|x, a) ≥ const. > 0 for
all x ∈ X, a ∈ A(x). Such conditions are hard to verify from market data and
–what is even worse– they are not satisfied for the control problem of portfolio
selection under transaction costs. Indeed, we have seen that the collection of
admissible actions (portfolio choices bi) at time i − 1 depends on the last d
observed return vectors as well as on the last chosen control bi−1 . Therefore,
we have no other choice than to describe the state of the system by the joint
vector (bi−1 , X̄i ). Then the transition dynamics under control bi is given by
(bi−1 , X̄i ) 7−→ (bi , X̄i+1 ). We thus end up with a transition probability distribu-
tion Q that clearly does not satisfy the mentioned recurrence and irreducibility
conditions. This drawback was also noted by Bielecki, Hernández and Pliska
(2000). They observe that the typical conditions imposed to ensure the ex-
istence of a solution of the Bellman equation are to rigid to be applicable in
transaction cost problems. On the other hand they found it still possible to char-
acterize optimal strategies in terms of optimality equations which correspond
to the classical Bellman equation. This is supported by the earlier findings of
Stettner (1999) and our results.
4.2.2 Proof of Theorem 4.2.1

In this section we prove the main result, Theorem 4.2.1. The proof requires
several steps. We first have to show the existence of a solution of the Bellman
equation and investigate certain properties of the solution. We then have to
show that the strategy b∗i calculated by (4.2.3) consists of portfolio choices that
are admissible. In a third step we will derive a technical tool to approximate
admissible strategies based on a portfolio selection function by simpler periodic
strategies (Lemma 4.2.6). It is only in the fourth step that, using the technical
tools derived before, we will completely prove Theorem 4.2.1.
1st step in the proof of Theorem 4.2.1: Solving the Bellman equation.
In dynamic programming, the Bellman equation usually takes the form

λ + h(s, x̄) = max m(b, x̄) + (1 − δ)E[h(b, X̄d+1)|X̄d = x̄] , (4.2.5)
b∈S(s,x̄)
which is to be solved for h ∈ C(S × [A, B]dm ) and λ ∈ IR (see e.g. Bertsekas,
1976, Sec. 8.1 or Hernández-Lerma and Lasserre, 1996, Sec. 5.2). We first de-
rive some basic facts about the existence of solutions and their properties. The
following proposition is well known from the theory of dynamic programming,
we nonetheless outline a proof.
Proposition 4.2.2. For all δ ∈ (0, 1) there exists a solution (h, λ) ∈ C(S ×
[A, B]dm ) × IR and a solution (h, 0) ∈ C(S × [A, B]dm ) × {0} of the Bellman
equation (4.2.5).
Proof. In the following let g ∈ C(S × [A, B]dm ), λ ∈ IR. The seminorm
kgk := max g(s, x̄) − min g(s, x̄)

(s,x̄)∈S×[A,B]dm (s,x̄)∈S×[A,B]dm
on C(S × [A, B]dm ) makes the factor space
C ∗ := C(S × [A, B]dm ){constant functions}
a Banach space with norm k[g]k := kgk, where [g] := {g(·, ·) + r|r ∈ IR} denotes
the equivalence class of g ∈ C(S × [A, B]dm ).
Indeed, K := {constant functions} is a closed subspace of the space (C(S ×
[A, B]dm , k · k∞). Then (C(S × [A, B]dm )K, k · k∗ ) with
k · k∗ : C(S × [A, B]dm )K −→ IR+

0 :
[f ] 7−→ k[f ]k∗ := inf kf + kk∞ = inf kf + ck∞
k∈K c∈IR
is a Banach space (Hirzebruch and Scharlau, 1996, Lemma 5.10). The norms
k·k∗ and k·k are equivalent on C(S ×[A, B]dm )K because of inf c∈IR kf +ck∞ =
1
2 kf k.
Note that, V3 implies that for any g ∈ C(S × [A, B]dm ), the conditional expec-
tation E[g(b, X̄d+1 )|X̄d = x̄] is continuous in (b, x̄) ∈ S × [A, B]dm , and S(·, ·) is
continuous in the sense of Aliprantis and Border (1999, Definition 16.2 and The-
orems 16.20, 16.21). Now, by Berge’s Maximum Theorem (Aliprantis and Bor-
der, 1999, Theorem 16.31), maxb∈S(s,x̄){m(b, x̄) + (1 − δ)E[g(b, X̄d+1 )|X̄d = x̄]}
is continuous on S × [A, B]dm . Hence we can define the operator
M: C(S × [A, B]dm ) −→ C(S × [A, B]dm ) :

g(s, x̄) 7−→ max m(b, x̄) + (1 − δ)E[g(b, X̄d+1 )|X̄d = x̄] .
b∈S(s,x̄)
On C ∗ , M operates according to M [g] := [M g], observe that the right hand side
is independent of the chosen representative of [g]. Solving (4.2.5) thus becomes
equivalent to solving the functional equation
M [h] = [h] (4.2.6)
in C ∗ . This can be accomplished by an application of Banach’s Fixed Point The-

orem. According to the Banach Fixed Point Theorem (Aliprantis und Border,
1999, Theorem 3.36), (4.2.6) can be solved using the iteration [h]n+1 := M [h]n
(value iteration), if only M is a contraction mapping. This will be shown in the
following, using standard techniques.
For functions g, h ∈ C(S × [A, B]dm ) we have

(M h)(s, x̄) = max m(b, x̄) + (1 − δ)E[h(b, X̄d+1 )|X̄d = x̄]
b∈S(s,x̄)
= m(b , x̄) + (1 − δ)E[h(b∗, X̄d+1 )|X̄d = x̄],
∗

(M g)(s, x̄) = max m(b, x̄) + (1 − δ)E[g(b, X̄d+1 )|X̄d = x̄]
b∈S(s,x̄)
≥ m(b∗, x̄) + (1 − δ)E[g(b∗, X̄d+1 )|X̄d = x̄]
for some b∗ ∈ S(s, x̄). Hence, writing max(s,x̄) instead of max(s,x̄)∈S×[A,B]dm ,
(M h)(s, x̄) − (M g)(s, x̄) ≤ (1 − δ)E[h(b∗ , X̄d+1 ) − g(b∗ , X̄d+1 )|X̄d = x̄]
≤ (1 − δ) max (h(s, x̄) − g(s, x̄))
(s,x̄)
for all (s, x̄) ∈ S × [A, B]dm , which yields
max ((M h)(s, x̄) − (M g)(s, x̄)) ≤ (1 − δ) max (h(s, x̄) − g(s, x̄)) .
(s,x̄) (s,x̄)
From this we find that
kM h − M gk = max ((M h)(s, x̄) − (M g)(s, x̄))

(s,x̄)
− min ((M h)(s, x̄) − (M g)(s, x̄))

(s,x̄)
= max ((M h)(s, x̄) − (M g)(s, x̄))
(s,x̄)
+ max ((M g)(s, x̄) − (M h)(s, x̄))
(s,x̄)
≤ (1 − δ) max (h(s, x̄) − g(s, x̄))
(s,x̄)
+(1 − δ) max (g(s, x̄) − h(s, x̄))
(s,x̄)
= (1 − δ)kg − hk
for 0 < 1 − δ < 1, a contraction property that implies the existence of a solution
of (4.2.5).
Finally, let (h, λ) ∈ C(S × [A, B]dm ) × IR be an arbitrary solution of (4.2.5),

λ + h(s, x̄) = max m(b, x̄) + (1 − δ)E[h(b, X̄d+1 )|X̄d = x̄] .
b∈S(s,x̄)
This is equivalent to (c ∈ IR arbitrary)

(λ − δc) + (h + c)(s, x̄) = max m(b, x̄) + (1 − δ)E[(h + c)(b, X̄d+1 )|X̄d = x̄] .
b∈S(s,x̄)
In particular, choosing c := λ/δ, for h̃ := h + c, we obtain the relation

n o
h̃(s, x̄) = (λ − δc) + h̃(s, x̄) = max m(b, x̄) + (1 − δ)E[h̃(b, X̄d+1 )|X̄d = x̄] ,
b∈S(s,x̄)
and (h̃, 0) also solves the Bellman equation. 2

The next lemma gives some technical properties of arbitrary solutions (h, 0) of
the Bellman equation.
Lemma 4.2.3. Let {δi }i be a monotonic decreasing sequence with δi ∈ (0, 1)

2
and (δi − δi+1 )/δi+1 ≤ 1. If (hi , 0) is a solution of the Bellman equation (4.2.5,
δ = δi ), then
1. δi khi k∞ ≤ kf k∞ .
2. khi+1 − hi k∞ ≤ kf k∞ .
3. Along any admissible portfolio sequence {bi}i we have
Ehi (bi−1, X̄i ) − Ehi (bi , X̄i+1 ) ≥ −2kf k∞ . (4.2.7)

In particular, for all j,
Ehi (bi−1 , X̄i ) − Ehi (a∗ , X̄j )

= Ehi (bi−1 , X̄i) − Ehi(a∗ , X̄i+1 ) ≥ −2kf k∞ . (4.2.8)
Proof. 1. The Bellman equation (4.2.5) with λ = 0 implies that kf k∞ + (1 −

δi )khik∞ ≥ khi k∞ , which can be rewritten as δi khik∞ ≤ kf k∞ .
2. Similarly as in the proof of Proposition 4.2.2,

hi (s, x̄) = max m(b, x̄) + (1 − δi )E[hi(b, X̄d+1 )|X̄d = x̄]
b∈S(s,x̄)
= m(b∗, x̄) + (1 − δi )E[hi (b∗ , X̄d+1 )|X̄d = x̄],

hj (s, x̄) = max m(b, x̄) + (1 − δj )E[hj (b, X̄d+1 )|X̄d = x̄]
b∈S(s,x̄)
≥ m(b , x̄) + (1 − δj )E[hj (b∗ , X̄d+1 )|X̄d = x̄]
∗
for some b∗ ∈ S(s, x̄). Taking differences yields
hi (s, x̄) − hj (s, x̄)

≤ (1 − δi )E[hi(b∗ , X̄d+1 )|X̄d = x̄] − (1 − δj )E[hj (b∗ , X̄d+1 )|X̄d = x̄]
≤ max {(1 − δi )hi (s, x̄) − (1 − δj )hj (s, x̄)}
(s,x̄)∈S×[A,B]dm
≤ (1 − δi )khi − hj k∞ + |δj − δi |khj k∞
≤ max{1 − δi , 1 − δj }khi − hj k∞ + |δj − δi | max{khik∞ , khj k∞ }.
The right hand side of this chain of inequalities remains the same when swapping
i and j, therefore
khi − hj k∞ ≤ max{1 − δi , 1 − δj }khi − hj k∞ + |δj − δi | max{khik∞ , khj k∞ }.
Using part 1 of the lemma, we conclude that
|δj − δi | · max{khi k∞ , khj k∞ } |δj − δi |

khi − hj k∞ ≤ ≤ kf k∞ .
1 − max{1 − δi , 1 − δj } min{δi , δj }2
2
Monotonicity of δi and the assumption (δi − δi+1 )/δi+1 ≤ 1 allows us to infer
δi − δi+1
khi − hi+1 k∞ ≤ 2 kf k∞ ≤ kf k∞ .
δi+1
3. The relation (4.2.8) is a direct consequence of (4.2.7) because of the station-

arity of {Xi}i and a∗ ∈ S(bi−1 , X̄i) being deterministic. To prove (4.2.7) observe
that the Bellman equation implies
hi (bi−1, X̄i ) ≥ m(bi, X̄i) + (1 − δi )E[hi(bi , X̄i+1 )|X̄i ],
and
Ehi (bi−1 , X̄i )−Ehi(bi , X̄i+1 ) ≥ Em(bi, X̄i )−δi Ehi (bi, X̄i+1 ) ≥ −kf k∞ −δi khi k∞ .
Plugging in the result from part 1 of the lemma yields
Ehi (bi−1 , X̄i ) − Ehi (bi, X̄i+1 ) ≥ −2kf k∞ ,
and the proof is finished. 2
2nd step in the proof of Theorem 4.2.1: Admissibility of {b∗i }i .

We will show that the maximization problem (4.2.3) is solved by a measurable
solution procedure b∗i = φi (b∗i−1 , X̄i) with suitable portfolio selection functions
φi . Thus {b∗i }i becomes an admissible strategy. The argument involves some
notions from set-valued analysis.
The admissible set S(s, x̄) for (s, x̄) ∈ S × [A, B]dm is a member of the family
C = C(S) of closed subsets of S. C is equipped with the σ-algebra generated
by families of the form FK := {A ∈ C(S)|A ∩ K 6= ∅} (K ranging over all
compact K ⊆ S) (Matheron, 1975, p. 27, or Molchanov, 1993, Chapter 1).
At time i − 1, an admissible strategy {bi }i picks some element of the random
set Si := S(bi−1 , X̄i ) = {b ∈ S|gc (bi−1 , X̄i , b) = 0} ∈ C. Si is a so-called
random closed set (RACS), a measurable mapping C : Ω → C. bi itself is a
selector, a random variable Ω → S such that bi ∈ Si with probability 1. A
short introduction to RACS and selectors can be found in Hernández-Lerma
and Lasserre (1996, Appendix D) and Bertsekas and Shreve (1978, Sec. 7.5).
Si is compact. Thus the solution of the maximization problem (4.2.3) is the
random non-empty set
( )

arg max Vi (b, X̄i) := b ∈ S(bi−1 , X̄i) Vi (b, X̄i) = sup Vi (c, X̄i ) .

b∈S(bi−1 ,X̄i ) c∈S(bi−1 ,X̄i )
We need only show that this is a RACS for which a suitable selector exists:
Lemma 4.2.4. The mapping
Ω → C : ω 7−→ arg max Vi (b, X̄i (ω))

b∈S(bi−1 (ω),X̄i (ω))
is a RACS for which a selector of the form φi (bi−1 , X̄i ) exists, with a measurable
function
φi : S × [A, B]dm −→ S.
From Lemma 4.2.4 it follows that the recurrence relation (4.2.3) is solved by
b∗0 := a∗ , b∗i := φi (b∗i−1 , X̄i).
In particular, b∗i is a selector for arg maxb∈S(bi−1 ,X̄i ) Vi (b, X̄i), hence Fi -measurable,
so that {b∗i }i constitutes an admissible strategy.
Proof of Lemma 4.2.4. The proof will be given in three steps. Fix some
i ∈ IN.
First note that S(·, ·) : S × [A, B]dm → C : (s, x̄) 7→ S(s, x̄) is a measurable
mapping, so that S(bi , X̄i+1 ) is a RACS: The continuity of gc implies that {b ∈
S|gc (s, x̄, b) = 0} is a closed subset of S. For the measurability of S(·, ·) we
need only show that for any compact K ⊆ S
S −1 (FK ) ∈ B(S × [A, B]dm ),
with the σ-algebra B(S × [A, B]dm ) of Borelian sets on S × [A, B]dm . To this
end, choose a countable dense subset K 0 of K. Using the continuity of gc again,
it is easily verified that

\ [ 1

S −1 (FK ) =

(s, x̄) |gc (s, x̄, k)| < ,
0
n
n∈IN k∈K
which implies the measurability of S −1 (FK ).

Secondly, we will see that α : C × [A, B]dm → C : (C, x̄) 7→ arg maxa∈C Vi(a, x̄)
is measurable. This combined with S(bi , X̄i+1 ) being a RACS yields that
arg max Vi (b, X̄i)

b∈S(bi−1 ,X̄i )
is itself a RACS. As to the measurability of α we consider C ∈ C and a compact

subset K of S. With

\ [ 1

−1

α (FK ) = (C, x̄) sup Vi (y, x̄) ≤ Vi (k, x̄) +
,
0 y∈C n
n∈IN k∈K
it suffices to verify that each of the sets

(C, x̄) sup Vi (y, x̄) ≤ c (c ∈ IR)
y∈C
is measurable. Indeed, if S 0 is a countable dense subset of S, then

(C, x̄) sup Vi(y, x̄) > c = {(C, x̄)|∃y ∈ C ∩ S 0 Vi(y, x̄) > c}

y∈C
[ [
= {(C, x̄)|y ∈ C, Vi (y, x̄) > c} = ({C|y ∈ C} × {x̄|Vi (y, x̄) > c})
y∈S 0 y∈S 0
[
−1

= F{y} × Vi (y, ·) (c, ∞) ,
y∈S 0
and the desired measurability follows.

Thirdly, we apply Theorem 7.33 in Bertsekas and Shreve (1978) (or Alipran-
tis and Border, 1999, Theorem 17.18). From there, for the closed set D :=
{(s, x̄, b)|b ∈ S(s, x̄)}, we can find a measurable function φi : {(s, x̄)|∃b :
(s, x̄, b) ∈ D} → S with
Vi(φi (s, x̄), x̄) = max Vi(b, x̄).

b∈{b|(s,x̄,b)∈D}
Now, the proof is finished observing that {b|(s, x̄, b) ∈ D} = S(s, x̄), {(s, x̄) | ∃b :
(s, x̄, b) ∈ D} = S × [A, B]dm (because of a∗ ∈ S(s, x̄)). Thus
φi (bi−1 , X̄i ) ∈ arg max Vi (b, X̄i)

b∈S(bi−1 ,X̄i )
with probability 1. 2
3rd step in the proof of Theorem 4.2.1: Approximation of strategies

that are based on portfolio selection functions.
For the analysis of the strategy b∗i it will be convenient to approximate strategies
based on portfolio selection functions by members of a smaller class of strategies
which we call “periodic” strategies:
Definition 4.2.5. An admissible strategy {bi }i is called N -periodic (N ∈ IN)

if for all k ∈ IN0
bkN = a∗ P − a.s..
For any admissible strategy {bi }i and any > 0 we define

( n n )
1 X 1X
N ({bi}i , ) := min n Ef (bi, X̄i+1 ) − lim sup Ef (bi , X̄i+1 ) ≤ .

n n→∞ n
i=1 i=1
N ({bi}i , ) measures how long it takes the strategy {bi}i to approach the long-
term optimum for the first time up to an error of . Note that there is some
N ∈ IN such that

1 XN n
1X
Ef (bi, X̄i+1 ) − lim sup Ef (bi , X̄i+1 ) ≤ ,

N n

n→∞
i=1 i=1
so that N ({bi}i , ) < ∞. Using N ({bi}i , ) one can aproximate any strategy
based on a portfolio selection function arbitrarily closely by a periodic strategy
in the following sense:
Lemma 4.2.6. Let {bi }i be an admissible strategy based on a portfolio selection

function c. Then for any > 0 there exists an admissible N -periodic strategy
{b̃i }i (N ∈ IN) with
n−1 n−1
!
1X 1X
lim inf Ef (b̃i , X̄i+1 ) − Ef (bi , X̄i+1) ≥ −.
n→∞ n i=0 n i=0
n−1
1 P
Proof. Let N := N (({bi}i, ), µn := n Ef (bi , X̄i+1 ) and s := lim supn→∞ µn ,
i=0
i.e.
|µN − s| ≤ . (4.2.9)
Considering the fact that, at any stage, the investor may choose the portfolio
a∗ , we define an N -periodic admissible strategy b̃i by
b̃0 := a∗ , b̃1 := c(a∗ , X̄1 ), ..., b̃N −1 := c(b̃N −2 , X̄N −1 ),

b̃N := a∗ , b̃N +1 := c(a∗ , X̄N +1 ), ..., b̃2N −1 := c(b̃2N −2 , X̄2N −1 ),
b̃2N := a∗ , etc. ... .
In particular, b̃0 = b0, ..., b̃N −1 = bN −1 . Hence, for all k ∈ IN0 (convention
1
P−1
0 i=0 ... = 0):
kN −1 k−1 (j+1)N −1 k−1

1 X 1X 1 X 1X
µ̃kN := Ef (b̃i, X̄i+1 ) = Ef (b̃i, X̄i+1 ) = µN = µN .
kN i=0 k j=0 N k j=0
i=jN
(4.2.10)
This follows from the construction of b̃i with the selection function c. Indeed, the
matrix (b̃jN , ..., b̃(j+1)N −1, X̄jN +1 , ..., X̄(j+1)N ) is a function of (X̄jN +1 , ..., X̄(j+1)N )
and as such is distributed as (b0 , ..., bN −1, X̄1 , ..., X̄N ) which is used to calculate
µN (by stationarity).
(4.2.9) and (4.2.10) imply that for all k ∈ IN0: |µ̃kN − s| ≤ . Now, let bncN be
the largest kN (k ∈ IN) with kN ≤ n. Then

n−1
bncN 1 X bncN
|µ̃n − s| = µ̃bncN − s + Ef (b̃i , X̄i+1 ) + − 1 s
n n
i=bncN
n
bncN N −1 n − bncN
≤ µ̃bncN − s + kf k∞ + |s|
n n n
N −1 N −1
≤ 1·+ kf k∞ + |s|,
n n
where we used n − bncN ≤ N − 1. It follows that lim supn→∞ |µ̃n − s| ≤ and

finally that
n−1 n−1
!
1X 1X
lim inf Ef (b̃i, X̄i+1 ) − Ef (bi , X̄i+1 ) ≥ lim inf µ̃n − lim sup µn
n→∞ n i=0 n i=0 n→∞ n→∞
= lim inf µ̃n − s ≥ lim inf − |µ̃n − s| = − lim sup |µ̃n − s| ≥ −,
n→∞ n→∞ n→∞
proving Lemma 4.2.6. 2

In the proof of Theorem 4.2.1, Lemma 4.2.6 will enable us to restrict ourselves
to the class of periodic strategies competing with {b∗i }i. These will turn out to
be much more tractable. We are now in the position to turn to the
4th step in the proof of Theorem 4.2.1: Finishing the proof of Theo-
rem 4.2.1.
Consider a given admissible strategy {bi }i based on a portfolio selection func-
tion. Let > 0 be arbitrary but fixed. According to Lemma 4.2.6 there exists
an N := N ({bi}i , )-periodic admissible strategy {b˜i }i with

n−1 n−1
!
1X 1X
lim inf Ef (b̃i , X̄i+1 ) − Ef (bi , X̄i+1) ≥ −.
n→∞ n i=0 n i=0
We will show that

n−1 n−1
!
1X 1X
lim inf Ef (b∗i , X̄i+1 ) − Ef (b̃i, X̄i+1 ) ≥ 0. (4.2.11)
n→∞ n i=0 n i=0
This yields
n−1 n−1
!
1X 1X
lim inf Ef (b∗i , X̄i+1 ) − Ef (bi , X̄i+1 )
n→∞ n i=0 n i=0
n−1 n−1
!
1X 1X
≥ lim inf Ef (b∗i , X̄i+1 ) − Ef (b̃i, X̄i+1 )
n→∞ n i=0 n i=0
n−1 n−1
!
1X 1X
+ lim inf Ef (b̃i, X̄i+1 ) − Ef (bi , X̄i+1 )
n→∞ n i=0 n i=0
≥ −,
and the assertion follows from being arbitrary.

It remains to show (4.2.11). To this end, observe that by V2

Ef (bi , X̄i+1 ) = E E[f (bi , X̄i+1)|Fi ] = E E[f (b, X̄i+1 )|Fi ]|b=bi

= E E[f (b, X̄i+1)|X̄i ]|b=bi = Em(bi, X̄i ).
Replacing f by m does not alter the value of E( n1 n−1

P
i=0 f (bi , X̄i+1 )), the target
function, hence we may prove (4.2.11) in the form
n−1 n
!
1X ∗ 1X
lim inf Em(bi , X̄i ) − Em(b̃i, X̄i ) ≥ 0.
n→∞ n i=0 n i=1
First note that the Bellman equation implies that for all admissible strategies
{bi }i
(1 − δi+1 )E[hi+1(bi+1 , X̄i+2 )|Fi+1 ]

= m(bi+1, X̄i+1 ) + (1 − δi+1 )E[hi+1(bi+1 , X̄i+2 )|Fi+1 ] − m(bi+1, X̄i+1 )

≤ max m(b, X̄i+1 ) + (1 − δi+1 )E[hi+1 (b, X̄i+2 )|Fi+1 ] − m(bi+1, X̄i+1 )
b∈S(bi ,X̄i+1 )
= hi+1 (bi , X̄i+1 ) − m(bi+1, X̄i+1 ).
Taking expectations,

(1 − δi+1 )Ehi+1 (bi+1 , X̄i+2 ) = (1 − δi+1 )E E[hi+1 (bi+1 , X̄i+2 )|Fi+1 ]
≤ Ehi+1 (bi , X̄i+1 ) − Em(bi+1, X̄i+1 ),
and summing up we are left with

n−2
!
1 X
E m(bi+1 , X̄i+1 )
n i=−1
n−2
!
1 X
≤ −E (1 − δi+1 )hi+1 (bi+1 , X̄i+2 ) − hi+1 (bi , X̄i+1 ) .
n i=−1
Equality holds for the strategy {b∗i }i , i.e.

n−1 n−1
! !
1X 1 X
E m(b∗i , X̄i ) − E m(b̃i, X̄i )
n i=0 n i=0
n−2
1 X
≥E hi+1 (b̃i+1 , X̄i+2 ) − hi+1 (b̃i, X̄i+1 )
n i=−1
n−2
1 X
hi+1 (b∗i+1 , X̄i+2 ) − hi+1 (b∗i , X̄i+1 )

−E
n i=−1
n−1
1X
+E δi hi (b∗i , X̄i+1) − hi (b̃i, X̄i+1 )
n i=0
=: A − B + C. (4.2.12)
We now investigate the asymptotic behaviour of the terms on the right hand
side.
The first and the second term A and B of (4.2.12) are of the same form and
tend to 0 as n → ∞ since
n−2
1 X
lim E hi+1 (bi+1 , X̄i+2 ) − hi+1 (bi , X̄i+1 ) = 0
n→∞ n
i=−1
for any admissible strategy {bi }i . To prove this, we set h−1 := 0 and consider
the decomposition
n−2
1 X
hi+1 (bi+1 , X̄i+2 ) − hi+1 (bi , X̄i+1 ) = D + E
n i=−1
into
n−2
1 X
D := hi+1 (bi+1 , X̄i+2 ) − hi (bi , X̄i+1 )
n i=−1
and
n−2
1 X
E := hi (bi , X̄i+1 ) − hi+1 (bi , X̄i+1 ) .
n i=−1
D is a telescopic sum, and using part 1 of Lemma 4.2.3 and the assumptions
about {δi }i , we find that

hn−1 (bn−1, X̄n ) − h−1 (b−1 , X̄0 ) khn−1 k∞ kf k∞
|D| = ≤ ≤ → 0.
n n δn−1 n
As to E we note that δ1 = δ2 , δ3 = δ4 = δ5 , δ6 = ... = δ9 , etc. implies that
√
h1 = h2 , h3 = h4 = h5 , h6 = ... = h9 , ... . Hence there are at most 2n + 1 non-
zero differences in E. By virtue of Lemma 4.2.3 these are bounded in absolute
value by maxi khi+1 − hi k∞ ≤ kf k∞ , which yields
√
2n + 1
|E| ≤ kf k∞ −→ 0.
n
The third term C of (4.2.12) is decomposed into

n−1
1X
E δi hi (b∗i , X̄i+1 ) − hi (b̃i , X̄i+1 )
n i=0
n−1
1X
δi hi (b∗i , X̄i+1 ) − hi+1 (b∗i , X̄i+1 )

= E
n i=0
n−1
1X
δi hi+1 (b∗i , X̄i+1 ) − hi+1 (a∗ , X̄bicN +1 )

+E
n i=0
n−1
1X
δi hi+1 (a∗, X̄bicN +1 ) − hbicN +1 (a∗ , X̄bicN +1 )

+E
n i=0
n−1
1X
+E δi hbicN +1 (a∗ , X̄bicN +1 ) − hi (b̃i, X̄i+1 )
n i=0
The absolute value of the first expectation in the decomposition is bounded

from above by
n−1
1X
δi kf k∞ −→ 0 (n → ∞)
n i=0
(Lemma 4.2.3, part 2). Using Lemma 4.2.3, (4.2.8), we can bound the second
expectation from below by
n−1
1X
−2kf k∞ δi −→ 0 (n → ∞).
n i=0
The third expectation has a lower bound

n−1 n−1
1X 1X
−kf k∞ δi (i − bicN ) ≥ −kf k∞ (N − 1) δi −→ 0 (n → ∞)
n i=0 n i=0
(Lemma 4.2.3, part 2). Therefore, we need only show that the fourth expecta-
tion satisfies
n−1
1X
lim inf δi EhbicN +1 (a∗ , X̄bicN +1 ) − Ehi (b̃i , X̄i+1 ) ≥ 0. (4.2.13)
n→∞ n i=0
This will be done exploiting the periodicity of {b̃i }i . In order to prove (4.2.13)
we first assume that n = kN with k ∈ IN0 :
kN −1
1 X
δi EhbicN +1 (a∗ , X̄bicN +1 ) − Ehi (b̃i , X̄i+1 )
kN i=0
k−1 (j+1)N −1
1X 1 X
= δi EhbicN +1 (a∗ , X̄bicN +1 ) − Ehi (b̃i , X̄i+1 ) .
k j=0 N
i=jN
Here, for i ∈ {jN, ..., (j + 1)N − 1}
EhbicN +1 (a∗ , X̄bicN +1 ) − Ehi (b̃i , X̄i+1 )

= EhjN +1 (b̃jN , X̄jN +1 ) − EhjN +1 (b̃jN +1 , X̄jN +2 )
+EhjN +1 (b̃jN +1 , X̄jN +2 ) − EhjN +2 (b̃jN +1 , X̄jN +2 )
+ ... − ...
+Ehi−1 (b̃i−1 , X̄i) − Ehi (b̃i−1 , X̄i )
+Ehi (b̃i−1, X̄i ) − Ehi (b̃i , X̄i+1 )
≥ −2kf k∞ (i − jN ) − kf k∞ (i − jN − 1), (4.2.14)
where the 1st, 3rd, etc. line after the equality (the non-indented terms) can be
bounded by (4.2.7), the 2nd, 4th, etc. term (the indented terms) by Lemma
4.2.3, part 2. Consequently,
EhbicN +1 (a∗, X̄bicN +1 ) − Ehi (b̃i , X̄i+1 ) ≥ −3kf k∞ (i − jN )
and
kN −1
1 X
δi EhbicN +1 (a∗ , X̄bicN +1 ) − Ehi (b̃i, X̄i+1 )
kN i=0
k−1 (j+1)N −1 k−1 (j+1)N −1
3kf k∞ X 1 X 3kf k∞ X δjN X
≥− δi (i − jN ) ≥ − (i − jN )
k j=0
N k j=0
N
i=jN i=jN
k−1 N −1 k−1
! ! !
1X 1 X 1X N −1
= −3kf k∞ δjN i = −3kf k∞ δjN . (4.2.15)
k j=0 N i=0 k j=0 2
For arbitrary n ∈ IN (not necessarily n = kN ), we find that

n−1
1X
δi EhbicN +1 (a∗ , X̄bicN +1 ) − Ehi (b̃i , X̄i+1 )
n i=0
 
 bnc bncN −1
N 1 X 
= · δi EhbicN +1 (a∗ , X̄bicN +1 ) − Ehi (b̃i, X̄i+1 )
 n bncN i=0 
 
1 X n−1 
+ δi EhbicN +1 (a∗ , X̄bicN +1 ) − Ehi(b̃i , X̄i+1 ) . (4.2.16)
n 
i=bncN
Set k := bn/N c and bound the first bracket in (4.2.16) from below by
k−1
!
bncN 3kf k∞ (N − 1) 1 X
− · δjN −→ 0 (n → ∞)
n 2 k j=0
(use (4.2.15) and note that n → ∞ implies k → ∞). The absolute value of the
second bracket is bounded from above by
n−1
1 X
δi khbicN +1 k∞ + khi k∞
n
i=bncN
n−1
1 X
≤ δbicN khbicN k∞ + δbicN khbicN +1 − hbicN k∞ + δi khi k∞
n
i=bncN
n−1
1 X n − bncN N −1
≤ 3kf k∞ = 3kf k∞ ≤ 3kf k∞ −→ 0 (n → ∞),
n n n
i=bncN
which concludes the proof. 2
4.3 Further properties of the value function

As a preparation for the next chapter, where we will be dealing with the case
when the distribution of the return process is unknown, we prove the following
result concerning the Lipschitz continuity of the value function hi :
Proposition 4.3.1. Let (hi , 0) be a solution of the Bellman equation (4.2.5)

for δ = δi (i = 1, 2, ...), then for sufficiently small commission factor c ≥ 0 there
exists a constant K > 0 such that
|δi · hi (b1, x̄) − δi · hi (b2, x̄)| ≤ K · kb1 − b2 k∞
for all b1 , b2 ∈ S, all x̄ ∈ [A, B]dm and all i ∈ IN.
Proof. The argument requires some less known notions from analysis, espe-
cially Clarke’s generalized derivative for Lipschitz continuous functions (Clarke,
0 00
1981). Let W ⊆ IRd and Z ⊆ IRd be Banach spaces (whose supremum-
norms are both denoted by k · k∞ ). Given a Lipschitz continuous mapping
Φ : W × Z → IR, Clarke’s generalized derivative is defined as the convex hull
of a limit set,
n o
∂w Φ(w, z) := conv lim ∇wi Φ(wi, z) wi → w ,

i→∞
where only those sequences wi are considered for which all gradients ∇wi Φ(wi , z)
and the limit limi→∞ ∇wi Φ(wi , z) exist (note that the gradients exist almost
everywhere due to Rademacher’s Theorem). H(M1 , M2 ) denotes the Hausdorff
0
distance between two subsets M1 and M2 of IRd , defined by

H(M1 , M2) := max sup ρ(w2 , M1 ), sup ρ(w1 , M2 )
w2 ∈M2 w1 ∈M1
with ρ(w2, M1 ) := inf w1 ∈M1 kw1 − w2k∞ .

4.3 Further properties of the value function 127
We will also need the following proposition by Ledyaev (1984, Theorem 1)

concerning Lipschitz continuity of implicitly defined set-valued functions:
0 00
Proposition 4.3.2. (Ledyaev, 1984, Theorem 1) Let W ⊆ IRd and Z ⊆ IRd
be Banach spaces and Φ : W × Z −→ IR a mapping such that:
1. For all z ∈ Z the function Φ(·, z) is Lipschitz continuous.
2. There exists a constant L > 0 with
|Φ(w, z1 ) − Φ(w, z2 )| ≤ Lkz1 − z2 k∞
for all w ∈ W, z1 , z2 ∈ Z.
3. There exists a constant ∆ such that for all (w, z) ∈ W ×Z with Φ(w, z) > 0
inf kvk∞ > ∆.

v∈∂w Φ(w,z)
Then the set-valued mapping M (z) := {w ∈ W |Φ(w, z) ≤ 0} satisfies the

Lipschitz property
L
H(M (z1 ), M (z2)) ≤ kz1 − z2 k∞ .
∆
Finally, we need the modulus of continuity, defined for any continuous function
g : S × [A, B]dm → IR as (x̄ ∈ [A, B]dm fixed)

ω g(·, x̄), := sup g(s, x̄) − g(t, x̄).

s,t∈S,ks−tk∞ ≤
Combining the Hausdorff distance and the modulus of continuity it is easily

seen that

max g(b, x̄) − max g(b, x̄) ≤ ω g(·, x̄), H(S(s, x̄), S(t, x̄)) . (4.3.1)
b∈S(s,x̄) b∈S(t,x̄)
Thus, having all the tools we need at hand, we can embark on the proof of
Proposition 4.3.1.
Let b = (b−1, ..., bm) ∈ S, x̄ = (x1, ..., xm) ∈ [A, B]dm . For fixed x̄ we set
Φ : S × S → IR : (b, s) 7→ Φ(b, s) := |gc (s, x̄, b)|, recall gc from (4.1.3). Clearly,
Φ is Lipschitz continuous in the first argument. Moreover,
|Φ(b, s) − Φ(b, t)| ≤ |gc (s, x̄, b) − gc (t, x̄, b)| ≤ c · const(r, B, m) · ks − tk∞
(to see this, note that the self-financing condition forces b−1 ≤ c). Taking the
gradient for gc (under Φ(b, s) > 0) yields
m
!
X
m
∂b Φ(b, s) ⊆ {1} × {0} × [0, c] · (1 + r)b0 + x k bk ,
k=1
so that
m
X
inf kvk∞ ≥ (1 + r)b0 + x k bk
v∈∂b Φ(b,s)
k=1
≥ min{1 + r, A}(1 − b−1 ) ≥ const(r, A)(1 − c).
As a consequence, the conditions of Propositon 4.3.2 are fulfilled. We can choose

∆ := const(r, A) · (1 − c) and L := c · const(r, B, m) to obtain
L
H(S(s, x̄), S(t, x̄)) ≤ ks − tk∞ . (4.3.2)
∆
For c sufficiently small, L/∆ ≤ 1. Recall the function
Vi (b, x̄) = m(b, x̄) + (1 − δi )E[hi (b, X̄d+1)|X̄d = x̄]
that defined hi in (4.2.2). The Bellman equation yields

hi (s, x̄) − hi (t, x̄) = max Vi(b, x̄) − max Vi(b, x̄) ≤ ω Vi (·, x̄), ks − tk∞
b∈S(s,x̄) b∈S(t,x̄)
by (4.3.1) and (4.3.2). Hence,

ω hi (·, x̄), ≤ ω Vi (·, x̄), ≤ ω f (·, x̄), + (1 − δi )ω hi (·, x̄), .

The Lipschitz property of f yields ω f (·, x̄), ≤ const. · and from the latter
chain of inequalities we obtain

const.
ω hi (·, x̄), ≤ · .
δi
Thus δi · hi (·, x̄) is Lipschitz continuous with the same Lipschitz constant as f
(independent of x̄), and the proof is finished. 2
129
CHAPTER 5
A Markov model with transaction costs:

statistical view
Chapter 4 completely remained within the probabilistic framework, i.e. the
point of view of an investor with full knowledge of the underlying return dis-
tribution. This, of course, is highly unrealistic. In practice, the investor’s view
is one of a statistician rather than a probabilist. From observations of stock
returns, factors and side information, he assembles a market “picture”, an idea
of the stochastic laws of the market. He then decides on an investment strat-
egy. In the following it will be shown how he can balance the need to avoid
transaction costs and the need to boost his wealth in his investment decisions
– without knowing the underlying return distribution. The model is the same
as in Chapter 4.
In particular, Section 5.1 sets up an empirical counterpart of the Bellman equa-
tion which we used in Theorem 4.2.1 to construct an optimal strategy. Based
on this empirical version of the Bellman equation we construct a portfolio se-
lection algorithm in Section 5.1.1 (cf. (5.1.5)). This algorithm has virtually
the same optimality properties as the algorithm in Chapter 4 (Theorems 5.1.1
and 5.1.2). To verify this, we need results on uniformly consistent regression
estimation which will be given in Section 5.2, Theorem 5.2.1 and Corollary
5.2.2 being the central results featuring the speed of uniform convergence of
kernel regression estimates. Section 5.3 finally gives the proof of the optimality
properties of the algorithm.
5.1 The empirical Bellman equation

The whole procedure described in the last chapter was founded on the Bellman
130 Chapter 5. A Markov model with transaction costs: statistical view
equation. In particular, in Theorem 4.2.1, we used the Bellman equation

λi + hi (s, x̄) = max m(b, x̄) + (1 − δi )E[hi (b, X̄d+1 )|X̄d = x̄] , (5.1.1)
b∈S(s,x̄)
λi ∈ IR, hi ∈ C(S × [A, B]dm ), to construct an optimal investment strategy in

a stationary d-stage Markov process {Xi }∞i=−d+1 of return vectors in a financial
market. Here,
m(b, x̄) := E[f (b, X̄i+1 )|X̄i = x̄] = E[f (b, X̄d+1)|X̄d = x̄],
f being the (logarithmic) utility function and S(s, x̄) being the set of admissible
portfolio vectors when x̄ denotes the last d observed return vectors and s the
last chosen portfolio. The Bellman equation was solved by a value iteration
type of algorithm in Chapter 4, which crucially relies on the distribution of
the stationary process {X̄i }i. This distribution, in general, is unknown to the
investor. Nonetheless he may try to estimate a solution of the Bellman equation.
A natural way to obtain an estimated solution is to replace the conditional
expectations E[hi(b, X̄d+1 )|X̄d = x̄] and m(b, x̄) in (5.1.1) by kernel estimates

i−1
X K (x̄ − X̄ j )/w i
hi (b, X̄j+1 ) · i−1
P
j=−d+1 K (x̄ − X̄k )/wi
k=−d+1
and
i−1
X K (x̄ − X̄j )/wi
f (b, X̄j+1 ) · i−1
,
P
j=−d+1 K (x̄ − X̄k )/wi
k=−d+1
respectively. K is a bounded, Lipschitz continuous kernel function [A, B]dm −→

IR+
R R
0 with IRdm K(x)dx = 1 and IRdm K(x)kxk∞dx < ∞. As we shall see,
the bandwidths can be chosen as wi ∼ i−1/(dm+2) . For simplicity, we us the
shorthand notation

K (x̄ − X̄j )/wi
Ki(X̄j , x̄) := i−1 with wi ∼ i−1/(dm+2) .
P
K (x̄ − X̄k )/wi
k=−d+1
5.1 The empirical Bellman equation 131
We thus obtain what we call the empirical Bellman equation,

 
i−1
 X 
λ̂i + ĥi (s, x̄) = max f (b, X̄j+1 ) + (1 − δi )ĥi (b, X̄j+1 ) Ki(X̄j , x̄) ,
b∈S(s,x̄)  
j=−d+1
(5.1.2)
which is to be solved for λ̂i ∈ IR, ĥi ∈ C(S × [A, B]dm ).
5.1.1 An optimal strategy

If we set up an investment strategy as in Theorem 4.2.1, however, using the
empirical Bellman equation instead of the original Bellman equation (4.2.2),
two questions arise: Is there a solution to the empirical Bellman equation and
if so, is the corresponding strategy optimal?
We first tackle the existence of solutions of (5.1.2). Using the fact that Ki(X̄j , x̄)
is a continuous function of x̄ and appealing to Berge’s Maximum Theorem
(Aliprantis and Border, 1999, Theorem 16.31) again, we can define an operator
M̂i by
M̂i : C(S × [A, B]dm ) −→ C(S × [A, B]dm ) :

 
i−1
 X 

h 7−→ max f (b, X̄j+1 ) + (1 − δi )h(b, X̄j+1 ) Ki (X̄j , x̄) .
b∈S(s,x̄)  
j=−d+1
M̂i is the empirical counterpart of the operator M in the proof of Proposition

4.2.2. Because of
Xi−1
Ki (X̄j , x̄) = 1, (5.1.3)
j=−d+1
M̂i operates on C(S × [A, B]dm )/{constant functions} according to M̂i [h] :=
[M̂i h]. Now, arguing similarly as in the proof of Proposition 4.2.2, we find that
M̂i is a contraction mapping in the norm kgk := max g − min g. Indeed, for
functions g, h ∈ C(S × [A, B]dm ) there exists a b∗ ∈ S(s, x̄) with
i−1
X
f (b∗ , X̄j+1 ) + (1 − δi )h(b∗ , X̄j+1 ) Ki (X̄j , x̄),

(M̂i h)(s, x̄) =
j=−d+1
i−1
X
f (b∗ , X̄j+1 ) + (1 − δi )g(b∗ , X̄j+1 ) Ki(X̄j , x̄).

(M̂i g)(s, x̄) ≥
j=−d+1
Using (5.1.3), this implies
(M̂i h)(s, x̄) − (M̂i g)(s, x̄)

i−1
X
h(b∗ , X̄j+1 ) − g(b∗, X̄j+1 ) Ki(X̄j , x̄) ≤ (1 − δi )kh − gk∞ .

≤ (1 − δi )
j=−d+1
Starting from here, one can argue exactly as in the proof of Proposition 4.2.2
to obtain kM̂i h − M̂i gk ≤ (1 − δi )kh − gk. Hence there exists a solution ĥi , λ̂i
of the empirical Bellman equation. Due to (5.1.3) this can be normalized to
 
i−1
 X 
ĥi (s, x̄) = max f (b, X̄j+1 ) + (1 − δi )ĥi (b, X̄j+1 ) Ki(X̄j , x̄) .
b∈S(s,x̄)  
j=−d+1
(5.1.4)
In the sequel, ĥi denotes a solution of (5.1.4).
We are now in a position to define the strategy which will turn out to have the
same optimality properties as the strategy in Chapter 4. On the analogy of
(4.2.3) the investor follows the strategy
b̂0 := a∗ and b̂i := arg max V̂i (b, X̄i) (5.1.5)

b∈S(b̂i−1 ,X̄i )
with
i−1
X
V̂i(b, x̄) := f (b, X̄j+1 ) + (1 − δi )ĥi (b, X̄j+1 ) Ki (X̄j , x̄).
j=−d+1
Observe that, in contrast to b∗i from Chapter 4, this strategy can be constructed
using observed data only, we need not know the underlying distribution of the
return process.
The important feature is: This strategy is still optimal if the kernel estimates
work sufficiently well. A sufficient condition is given in the following theorem.
Theorem 5.1.1. Under the assumptions of Theorem 4.2.1, let hi be the

solutions of the Bellman equation (5.1.1, λi = 0) and define the class G :=
{δi · hi , δi · f |i = 1, 2, ...}.
5.1 The empirical Bellman equation 133
Then
i
X
1
lim 2 sup E sup g(b, X̄j+1 )Ki+1 (X̄j , x̄) (5.1.6)
i→∞ δi+1 g∈G dm
b∈S,x̄∈[A,B]

j=−d+1

−E[g(b, X̄i+2)|X̄i+1 = x̄] = 0
implies that
n−1 n−1
!
1X 1X
lim inf Ef (b̂i , X̄i+1 ) − Ef (bi , X̄i+1 ) ≥0 (5.1.7)
n→∞ n i=0 n i=0
for any admissible strategy {bi}i based on a portfolio selection function.
This theorem translates the investment problem into a regression estimation

problem. Of course, it is desirable to have practical sufficient conditions for the
assumptions of Theorem 5.1.1 to hold. As we shall see, choosing δi ≥ 1/ log i
suffices, for example we may choose δ1 = δ2 := d1 , δ3 = δ4 = δ5 := d2, etc. with
di ∼ 1/ log i (cf. Theorem 5.1.2 below).
In the following, a stochastic process {Xi}∞
i=−∞ is called a stationary GSM-
process (geometrically strongly mixing), if beyond stationarity the follow-
ing holds: There exist constants c > 0 and ρ ∈ [0, 1) such that the α-mixing
coefficients
α(k) := α(σ(Xi, i ≤ 0), σ(Xi, i ≥ k)) := sup |P(B ∩ C) − P(B)P(C)|

B∈σ(Xi ,i≤0)
C∈σ(Xi ,i≥k)
satisfy
α(k) ≤ c · ρk (k ≥ 1).
The behaviour of the α-mixing coefficients α(k) mirrors how fast dependency
in the process variables decays for large time lags k. Under mild assumptions,
the class of GSM-processes comprises linear processes (Bosq, 1996, Sec. 1.3,
2.3), polynomial AR-processes (Doukhan, 1994, Sec. 2.4.1, Th. 5) such as
ARMA-processes (Doukhan, 1994, Sec. 2.4.1.2, Th. 6, Cor. 3), and Doeblin-
or Harris-recurrent Markov chains (Doukhan, 1994, Sec. 2.4, Th. 1 and 3).
Theorem 5.1.2. The assumptions of Theorem 5.1.1 are fulfilled if the following
hold:
1. {Xi }∞ m
i=−d+1 is a stationary [A, B] -valued d-stage Markov process (cf. V1-
V3 in Section 4.1) and geometrically strongly mixing.
2. There exist densities fX̄0 and fX0 |X̄0 of the distributions PX̄0 and PX0 |X̄0 ,
respectively, such that
– fX̄0 is Lipschitz continuous, i.e., for some C > 0,
|fX̄0 (x̄) − fX̄0 (ȳ)| ≤ Ckx̄ − ȳk∞ for all x̄, ȳ ∈ [A, B]dm ,
– the level sets {x̄ : fX̄0 (x̄) ≥ 1/n} of fX̄0 satisfy
H(supp fX̄0 , {x̄ : fX̄0 (x̄) ≥ 1/n}) ≤ n−k
for some k > 0, H denoting the Hausdorff distance (cf. Section 4.3),
– fX0 |X̄0 is Lipschitz continuous such that for some C > 0,
|fX0 |X̄0 (x, x̄) − fX0 |X̄0 (x, ȳ)| ≤ Ckx̄ − ȳk∞
for all x ∈ [A, B]m , x̄, ȳ ∈ [A, B]dm .
3. The commission factor c ist sufficiently small.
4. δi ≥ 1/ log i satisfies (4.2.1).
Hence, under the (not too restrictive) conditions of Theorem 5.1.2 we are able
to construct an admissible strategy {b̂i }i that is superior to any other admissible
strategy {bi}i based on a portfolio selection function in the sense of
n−1 n−1
!
1X 1X
lim inf Ef (b̂i , X̄i+1 ) − Ef (bi , X̄i+1 ) ≥ 0.
n→∞ n i=0 n i=0
It should be stressed again that this is a conservative, i.e. worst case analysis,
the lim inf giving the worst possible performance of our strategy {b̂i}i .
Remark. There are a number of sufficient conditions for {Xi }i to be a station-
ary GSM-process (see, e.g., Doukhan, 1994). For example, the GSM property
holds if there exists a continuous function r : [A, B]m → IR+
0 with
fX0 |X̄0 (x0 |x̄0) ≥ r(x0)

5.2 Uniformly consistent regression estimation 135
for all x̄0 ∈ [A, B]dm , x0 ∈ [A, B]m and

Z
r(x0)dx0 > 0.
[A,B]m
To verify this, one may use the Doeblin condition in Theorem 1 of Doukhan
(1994, Sec. 2.4): The transition probabilities of the d-stage Markov process
{Xi }i are Z
P(x̄0, C) = fX0 |X̄0 (x0|x̄0 )dx0
C
for x̄0 ∈ [A, B]dm , C ∈ B([A, B]m ). In particular, with the measure µ := r · λ (λ
denoting the Lebesgue-Borel-measure on [A, B]m ),
Z
P(x̄0 , C) ≥ r(x0)dx0 = µ(C)
C
and Z
m
µ([A, B] ) = r(x0 )dx0 ∈ (0, 1].
[A,B]m
Under these circumstances, the cited theorem of Doukhan (1994) allows us to

conclude that {Xi }i is a GSM-process.
5.1.2 How to prove optimality

Before we actually prove the main results, Theorems 5.1.1 and 5.1.2, we sketch
the way in which we are going to proceed. Before we can start with the proof,
we need to establish some results on uniformly consistent regression estimation.
This will be done in the next section. Once having got results on the speed of
uniform almost sure convergence we can embark on the core of the proof of the
theorems in Section 5.3. There, we argue along the lines of the corresponding
results in Chapter 4, using uniform consistency results to pass from the solu-
tion of the empirical Bellman equation to the solution of the original Bellman
equation.
5.2 Uniformly consistent regression estimation

We start with the following curve estimation problem. Given data X0 , X1 , ..., Xn
from a stationary IRd -valued stochastic process {Xi}∞

i=−∞ and a function g :
d d0
S × IR −→ IR (S ⊆ IR compact), the objective is to estimate both, the
function
φ(g, b, x) := E[g(b, X1 )|X0 = x] · fX0 (x)
as well as the regression function
R(g, b, x) := E[g(b, X1 )|X0 = x].
The first will be estimated uniformly in b ∈ S and x ∈ IRd by the kernel estimate
n−1
1 X x − Xi
Zn (g, b, x) := g(b, Xi+1)K , (5.2.1)
nwnd i=0 wn
the latter by
Zn (g, b, x)
Rn (g, b, x) := .
Zn (1, b, x)
K : IRd −→ IR+ 0R is assumed to be a fixed bounded, Lipschitz continuous kernel
function with IRd K(x)dx = 1 and IRd K(x)kxk∞dx < ∞. wn ∈ IR+ is a
R
sequence of bandwidths to be chosen later (such that limn→∞ wn = 0). 1 denotes

the constant function (b, x) 7→ 1. Note that Zn (1, b, x) is an estimate of the
density fX0 (x).
Bosq (1996, Theorems 2.2, 3.3, 3.3) derives rates for the uniform almost sure
convergence of Zn (1, b, x) and Zn(g, b, x) for a fixed g in stationary GSM-
processes. Now, the following theorem generalizes the results of Bosq such
as to feature the rate of convergence of the expected k · k∞ -error of the es-
timate (5.2.1) in GSM-processes uniformly over a huge class of functions g.
For this, we agree on the following notation: F is the class of sequences
{c ns logt n}n∈IN (c > 0, s > 0, t ∈ IR or c > 0, s = 0, t > 1). For any set
X ⊆ IRd we denote the k · k∞-diameter of X by diam(X ) := supx,y∈X kx − yk∞ .
fX0 is a density of X0 , fX1 |X0 a density of the conditional distribution PX1 |X0 .
For the estimation of φ, we will prove the
d
Theorem 5.2.1. Let {Xi }∞ i=0 be a stationary GSM-process of IR -valued ran-
dom variables with a Lipschitz continuous density fX0 . Moreover, let G be a
0
class of functions g : S × IRd −→ IR (S ⊆ IRd compact) with the follow-
ing property: There exists a constant C > 0 such that for all g ∈ G and all
a, b ∈ S, x, y ∈ IRd
kgk∞ ≤ C, (5.2.2)
Z
|g(a, x)|dx ≤ C, (5.2.3)
IRd
|g(a, x) − g(b, x)| ≤ C · ka − bk∞, (5.2.4)
|E[g(b, X1 )|X0 = x] − E[g(b, X1)|X0 = y]| ≤ C · kx − yk∞ . (5.2.5)
Furthermore, choose wn := n−1/(d+2) . Then

!
logβ n
sup E sup |Zn (g, b, x) − φ(g, b, x)| = o
g∈G x∈Xn ,b∈S n1/(d+2)
for any β > 1 and any sequence Xn ⊆ IRd with diam(Xn) ∈ F.
Now, if the support of fX0 , supp fX0 = {x : fX0 (x) > 0}, is a compact subset
of IRd , the theorem may be used to derive the following result concerning the
estimation of the regression function R. Recall that H denotes the Hausdorff
distance (cf. Section 4.3).
Corollary 5.2.2. Under the assumptions of Theorem 5.2.1, if fX0 has compact
support and the level sets

1
Xn := x : fX0 (x) ≥ (5.2.6)
n
satisfy
const.
H (suppfX0 , Xn) ≤ (0 < k ≤ ∞), (5.2.7)
nk
then
!
logβ n
sup E sup |Rn (g, b, x) − R(g, b, x)| = O
g∈G x∈suppfX0 ,b∈S nk/((d+2)(k+1))
for any β > 1.
Remark. The additional assumption (5.2.7) is not too restrictive. In partic-

ular, if inf x∈X f (x) ≥ const. > 0, we can chose k = ∞ to obtain the “optimal”
rate !
logβ n
sup E sup |Rn (g, b, x) − R(g, b, x)| = O
g∈G x∈suppfX0 ,b∈S n1/(d+2)
for any β > 1 (for the optimality of rates see, e.g., Györfi et al., 2002).
The proofs of Theorems 5.2.1 and Corollary 5.2.2 refine arguments used in the
proofs of Theorems 2.2 and 3.2 in Bosq (1996). We first give the
Proof of Theorem 5.2.1. We write supx,b instead of supx∈Xn ,b∈S .The estima-
tion error can be decomposed into stochastic and deterministic error,
sup |Zn (g, b, x) − φ(g, b, x)|

x,b
≤ sup |Zn (g, b, x) − EZn (g, b, x)| + sup |EZn (g, b, x) − φ(g, b, x)|.
x,b x,b
1st step: Analysis of the deterministic error sup |EZn − φ|.

Write EZn(g, b, x) as
n−1 Z
1 X x − Xi
EZn (g, b, x) =
nwnd i=0 IRd
E g(b, Xi+1)K
wn Xi = y · fXi (y)dy
Z
= φ(g, b, x − zwn)K(z)dz,
IRd
R
so that, with K(z)dz = 1,
Z
sup |EZn (g, b, x) − φ(g, b, x)| ≤ sup |φ(g, b, x − zwn) − φ(g, b, x)|K(z)dz.
x,b x,b IRd
The integrand is bounded by
|φ(g, b, x − zwn) − φ(g, b, x)|

≤ E[g(b, X1)|X0 = x − zwn] − E[g(b, X1)|X0 = x] · fX0 (x − zwn)

+fX0 (x − zwn) − fX0 (x) · E |g(b, X1 )|X0 = x
(5.2.8)
≤ const. · kzk∞wn ,
R
where we have used (5.2.2) - (5.2.5). By assumption, the integral K(z)kzk∞dz
is finite, which yields
sup |EZn (g, b, x) − φ(g, b, x)| ≤ const. · wn . (5.2.9)

x,b
Note that the constant only depends on K, fX0 and on C in (5.2.2) - (5.2.5),
not however on the specific g ∈ G under consideration.
2nd step: Analysis of the deterministic error sup |Zn − EZn|.
We first cover Xn by dνn ed (νn ≥ 1) cubes

diam(Xn)
C(j, n) := x kx − xj,nk∞ ≤

2 · dνn e
with side lengths diam(Xn)/dνn e and centres xj,n (j = 1, ..., dνned ). Analogously,
0
S is covered by dνn ed cubes

diam(S)
S(k, n) := b ∈ S kb − bk,n k∞ ≤

2 · dνn e
0
(k = 1, ..., dνned ). From this,
sup |Zn (g, b, x) − EZn (g, b, x)| = sup sup |Zn (g, b, x) − EZn (g, b, x)|
x,b j,k x∈C(j,n),b∈S(k,n)
≤ sup |Zn (g, bk,n , xj,n) − EZn (g, bk,n , xj,n)|
j,k
+ sup sup |Zn (g, b, x) − Zn (g, bk,n , xj,n)|
j,k x∈C(j,n),b∈S(k,n)
+ sup sup |EZn (g, b, x) − EZn (g, bk,n, xj,n )|.
j,k x∈C(j,n),b∈S(k,n)
On the other hand, using (5.2.2) - (5.2.4),
|Zn (g, b, x) − Zn (g, bk,n, xj,n )|

n−1
1 X x − Xi
≤ g(b, Xi+1 ) − g(bk,n , Xi+1 ) · K

nwd n i=0 wn
n−1
1 X x − Xi xj,n − Xi
+ |g(b k,n , X i+1 )| · K − K
nwnd i=0 wn wn
1 1 kx − xj,nk∞
≤ const. · · kb − bk,n k∞ + const. · d ·
wnd wn wn
max(diam(Xn), diam(S))
≤ const. · ,
wnd+1 νn
where without loss of generality we have assumed that wn ≤ 1. Again, the
constant does not depend on b ∈ S, x ∈ IRd and g ∈ G.
Using the same argument, we find that
max(diam(Xn ), diam(S))
|EZn (g, b, x) − EZn (g, bk,n , xj,n)| ≤ const. · ,
wnd+1 νn
and hence
sup |Zn (g, b, x) − EZn(g, b, x)|
x,b
≤ sup |Zn (g, bk,n , xj,n) − EZn (g, bk,n , xj,n)|
j,k
+2 · const. · .
wnd+1 νn
This result yields (0 < rn → ∞ will represent the desired rate of convergence
at a later stage)
rn · E sup |Zn (g, b, x) − EZn (g, b, x)|
x,b

≤ E rn · sup |Zn (g, bk,n , xj,n) − EZn (g, bk,n , xj,n)|
j,k
+2 · const. · rn ·
wnd+1 νn
d
2kgk∞ kKk
Z ∞ rn /wn

= P sup |Zn (g, bk,n , xj,n) − EZn (g, bk,n , xj,n)| > d
j,k rn
0
+2 · const. · rn ·
wnd+1 νn
d
2kgk∞ kKk r /w
Z ∞n n
X
≤ µ+ P |Zn (g, bk,n , xj,n) − EZn (g, bk,n , xj,n)| > d
rn
j,k µ
+2 · const. · rn · , (5.2.10)
wnd+1 νn
where µ > 0 is arbitrary.
3rd step: Combining the results of step 1 and step 2.
Assume for the moment that

P |Zn (g, bk,n, xj,n ) − EZn(g, bk,n , xj,n)| >
rn
(1) (2)
has an upper bound pn () + pn () independent of g, b and x with the following
four properties:
1. We have
Z∞
p(1)
n ()d < ∞. (5.2.11)
µ
2. For all > 0:

0
lim (νn + 1)d+d p(1)
n () = 0. (5.2.12)
n→∞
3. For all µ > 0 there exists a N (µ) ∈ IN such that for all ≥ µ we have
that
0
(νn + 1)d+d p(1)
n () is monotonic decreasing for n ≥ N (µ). (5.2.13)
4. We have
d
2kgk∞ kKk
Z ∞ rn/wn
0
lim (νn + 1)d+d p(2)
n ()d = 0. (5.2.14)
n→∞
µ
We then infer from (5.2.9) and (5.2.10) that
lim sup sup rn · E sup |Zn (g, b, x) − φ(g, b, x)|

n→∞ g∈G x∈Xn
b∈S
Z∞
d+d0
≤ µ + lim sup (νn + 1) p(1)
n ()d
n→∞
µ
d
2kgk∞ kKk
Z ∞ rn /wn
0
+ lim sup (νn + 1)d+d p(2)
n ()d
n→∞
µ

+ lim sup const. · rn · wn + .
n→∞ wnd+1 νn
The second term is zero using (5.2.11)-(5.2.13) and the monotone convergence
theorem, the third term is zero because of (5.2.14). µ being arbitrary we obtain
lim sup sup rn · E sup |Zn (g, b, x) − φ(g, b, x)|

n→∞ g∈G x,b

≤ lim sup const. · rn · wn + , (5.2.15)
n→∞ wnd+1 νn
from which we shall determine rn .
(1) (2)
4th step: Finding a bound pn () + pn () for the 3rd step.
To this end let

1 x − Xi x − Xi
Wi,n := Wi,n (b, x) := d g(b, Xi+1 )K − Eg(b, Xi+1 )K .
wn wn wn
Simple calculations yield

kKk∞kfX0 k∞ kgk2∞
VarWi,n ≤ ,
wnd

2 kKk∞
|Cov(Wi,n , Wj,n )| ≤ kfX0 k∞ kgk∞ + kfX0 k∞ ,
wnd
and we have
n−1
1X
Zn (g, b, x) − EZn (g, b, x) = Wi,n . (5.2.16)
n i=0
The property of {Xi }i being a stationary GSM-process is inherited by {Wi,n }i
whose kth α-mixing coefficient is less than α(k − 1), the (k − 1)st α-mixing
coefficient of {Xi}i (independent of b ∈ S, x ∈ IRd ). To bound the tails of
Zn − EZn , we exploit the expansion (5.2.16). We also use Theorem 1.3 in Bosq
(1996) which states a tail inequality for empirical means of centered random
variables in terms of their α-mixing coefficients:
Proposition 5.2.3. (Bosq, 1996, Theorem 1.3) Let {Yi}∞ i=−∞ be a centered
real-valued stochastic process with sup1≤i≤n kYik∞ ≤ D. Then for any q ∈
[1, n/2] and any > 0

n−1
! 2 1/2
1 X q 4D n
P Yi > ≤ 4 exp − 2 + 22 1 + dqeα ,

n 8v 2q
i=0
where
n
p :=
2q
2 D
2
v := 2 σ(q)2 +
p 2

2
σ (q) := max E bjpc + 1 − jp Ybjpc+1 + Ybjpc+2 + ...
0≤j≤2dqe−1
2
+Yb(j+1)pc + (j + 1)p − b(j + 1)pc Yb(j+1)p+1c .
The proposition will be applied to the centered GSM-process {Wi,n }i . Multiply-

ing out the sum defining σ 2 (q) we obtain at most (p+2)2 terms. Above variance
and covariance bounds for Wi,n then yield that for any q = qn ∈ [1, n/2]
2 2 const.
2
σ (q) ≤ ,
p wnd
where the constant depends on nothing but K, fX0 and C from (5.2.2)-(5.2.3).
Hence (set D := C · kKk∞wn−d )

P |Zn (g, b, x) − EZn (g, b, x)| > ≤ p(1) (2)
n () + pn ()
rn
with (const. being another suitable constant)
2 qn wnd

p(1)
n () := 4 exp − ,
const. · (1 + )rn2
1/2
(2) const. · rn n
pn () := 22 1 + dqn eα −1 .
wnd 2qn
5th step: We can now move on to finding appropriate wn, rn , νn in (5.2.15).

(1) (2)
The crux is to satisfy (5.2.11)-(5.2.14) with the above pn and pn . (5.2.11)
R∞
is fulfilled because of µ exp(−)d < ∞. Elementary calculations show that
(5.2.12) and (5.2.13) are satisfied if only
qn wnd 0
∈F and νnd+d ∈ F (5.2.17)
rn2
(observe that {an }, {bn} ∈ F and 0 ≤ ρ < 1 implies an ρbn & 0). As to (5.2.14)
we use
d
2kgk∞ kKk
Z ∞ rn /wn
0
(νn + 1)d+d p(2)
n ()d
µ
d
2kgk∞ kKk
Z ∞ rn /wn 1/2
d+d 0 n const. · rn
≤ 22 · (νn + 1) · dqn eα −1 1+ d
2qn wnd
0

0 n rn
≤ const. · (νn + 1)d+d · dqn eα − 1 · d.
2qn wn
Thus (5.2.14) is satisfied if only
0
rn νnd+d

n
lim · dqn eα − 1 = 0. (5.2.18)
n→∞ wnd 2qn
So it suffices to satisfy (5.2.17) and (5.2.18). This is done by the choice
1
rn := (β > 1)
wn logβ n
diam(S) + diam(Xn )
νn :=
wnd+2
1
wn := 1/(d+2)
n
n
qn := (2β − 1 > a > 1).
loga n
Indeed, from qn wnd /rn2 = log2β−a n ∈ F and diam(Xn) ∈ F we have (5.2.17).
Moreover,
0
rnνnd+d

n
· dqn eα −1
wnd 2qn
1+d+d0 +(1+d)/(2+d)
a
d+d0 n log n
≤ const. · (diam(S) + diam(Xn)) · ·α −1 ,
loga+β n 2
and the GSM-property yields (5.2.18) (observe again that {an}, {bn} ∈ F and
0 ≤ ρ < 1 implies an ρbn & 0).
Finally, (5.2.15) now reads
1
lim sup sup rn · E sup |Zn (g, b, x) − φ(g, b, x)| ≤ lim sup const. · = 0. 2
n→∞ g∈G x,b n→∞ logβ n
The proof of Corollary 5.2.2 is more straightforward.

Proof of Corollary 5.2.2. Set X = supp fX0 and Xn0 := Xnγ , where γ > 0 is
adjusted later. We write supx,b instead of supx∈X ,b∈S . Clearly,
sup E sup |Rn (g, b, x) − R(g, b, x)|

g∈G x,b
≤ sup E sup |Rn (g, b, x) − R(g, b, x)|
g∈G x∈Xn0 ,b∈S
+ sup E sup ∗inf 0 |Rn (g, b, x) − Rn (g, b, x∗)|
g∈G x,b x ∈Xn
+ sup E sup ∗inf 0 |R(g, b, x) − R(g, b, x∗)|.
g∈G x,b x ∈Xn
Condition (5.2.7) implies
sup ∗inf 0 kx − x∗k∞ ≤ const. · n−kγ .

x∈X x ∈Xn
Using (5.2.2)-(5.2.5), we can bound the second and the third term from above
by
const. · sup ∗inf 0 kx − x∗ k∞ ≤ const. · n−kγ .
x∈X x ∈Xn
5.3 Proving the optimality of the strategy 145
Using Theorem 5.2.1, the first term satisfies
sup E |Rn (g, b, x) − R(g, b, x)|

sup
g∈G x∈Xn0 ,b∈S

Zn (g, b, x) Zn (g, b, x) φ(g, b, x)
= sup E sup Rn (g, b, x) −
+ −
g∈G x∈Xn0 ,b∈S fX0 (x) fX0 (x) fX0 (x)
supx∈Xn0 ,b∈S |Rn (g, b, x)|
≤ sup E · sup |fX0 (x) − Zn(1, b, x)|
g∈G inf x∈Xn0 fX0 (x) x∈Xn0 ,b∈S
1
+ sup E sup |Zn (g, b, x) − φ(g, b, x)|
inf x∈Xn0 fX0 (x) g∈G x∈Xn0 ,b∈S
logβ n
≤ const. · nγ .
n1/(d+2)
Consequently,
sup E sup |Rn (g, b, x) − R(g, b, x)|

g∈G x,b
! !
logβ n 1 logβ n
=O + =O ,
n1/(d+2)−γ nkγ nk/((d+2)(k+1))
the latter equality for the balanced choice γ = 1/((d + 2)(k + 1)). 2
5.3 Proving the optimality of the strategy

With the results of the previous section, we are in a position to give the
Proof of Theorem 5.1.1. Analyzing the strategy {b̂i }i , our first task is to
derive an inequality as (4.2.12) for ĥi instead of hi and b̂i instead of b∗i .
From the empirical Bellman equation, we find that
ĥi+1 (bi , X̄i+1 )

i
X
= max f (b, X̄j+1 ) + (1 − δi+1 )ĥi+1 (b, X̄j+1 ) Ki+1 (X̄j , X̄i+1)
b∈S(bi,X̄i+1 )
j=−d+1
i
X
≥ f (bi+1 , X̄j+1 ) + (1 − δi+1 )ĥi+1 (bi+1 , X̄j+1 ) Ki+1 (X̄j , X̄i+1 )
j=−d+1
for any admissible portfolio strategy {bi }i. Equality holds for {b̂i }i , which yields
m(b̂i+1, X̄i+1 ) − m(bi+1 , X̄i+1 ) (5.3.1)

(1) (2)
≥ Hi (bi+1 ) + Hi (bi+1 ) − ĥi+1 (bi , X̄i+1 ) + (1 − δi+1 )E[ĥi+1 (bi+1 , X̄i+2 )|Fi+1 ]
(1) (2)
−Hi (b̂i+1 ) − Hi (b̂i+1 ) + ĥi+1 (b̂i , X̄i+1 ) − (1 − δi+1 )E[ĥi+1(b̂i+1 , X̄i+2 )|Fi+1 ]
with
i
(1)
X
Hi (b) := f (b, X̄j+1 )Ki+1 (X̄j , X̄i+1 ) − m(b, X̄i+1),
j=−d+1
and
i
X
(2)
Hi (b) := (1−δi+1 ) ĥi+1 (b, X̄j+1)Ki+1 (X̄j , X̄i+1 )−E[ĥi+1 (b, X̄i+2 )|Fi+1 ] .
j=−d+1
(1) (2)
Now, we investigate the asymptotics of the terms Hi and Hi in (5.3.1).
Clearly,

X i
(1)
sup |Hi (b)| ≤ sup
f (b, X̄j+1 )Ki+1(X̄j , x̄) − m(b, x̄) . (5.3.2)
b∈S b∈S,x̄∈[A,B]dm j=−d+1
(2)
To analyse Hi we define a Fi -measurable random variable ci ,
ci := arg min kĥi − hi + ck∞ ,

c∈IR
and obtain
(2)
sup |Hi (b)|
b∈S
i
X
≤ sup
hi+1 (b, X̄j+1 )Ki+1 (X̄j , X̄i+1 ) − E[hi+1 (b, X̄i+2)|Fi+1 ]
b∈S
j=−d+1
i
X
+ (ĥi+1 (...) − hi+1 (...) + ci+1 )Ki+1 (...)
j=−d+1

−E[ĥi+1 (...) − hi+1 (...) + ci+1 |Fi+1 ]
i
X
≤ sup
hi+1 (b, X̄j+1 )Ki+1(X̄j , x̄)
b∈S,x̄∈[A,B]dm j=−d+1

−E[hi+1 (b, X̄i+2 )|X̄i+1 = x̄] + 2kĥi+1 − hi+1 + ci+1 k∞
i
X
≤ sup
hi+1 (b, X̄j+1 )Ki+1(X̄j , x̄)
dm
b∈S,x̄∈[A,B] j=−d+1

−E[hi+1 (b, X̄i+2 )|X̄i+1 = x̄] + kĥi+1 − hi+1 k. (5.3.3)
For this, recall the norm k · k = 2 inf c∈IR k · +ck∞ on C(S × [A, B]dm ). By the
contraction property of M̂i ,
kĥi − hi k = kM̂i ĥi − Mi hi k

≤ kM̂i ĥi − M̂i hi k + kM̂i hi − Mi hi k
≤ (1 − δi )kĥi − hi k + 2kM̂i hi − Mi hi k∞ ,
and hence
2
kĥi − hi k ≤ kM̂i hi − Mi hi k∞ . (5.3.4)
δi
It is easily established that
kM̂i hi − Mi hi k∞ (5.3.5)

i−1
X
≤ sup
f (b, X̄j+1 )Ki (X̄j , x̄) − m(b, x̄)

i−1
X
+(1 − δi ) sup
hi (b, X̄j+1 )Ki(X̄j , x̄) − E[hi(b, X̄d+1 )|X̄d = x̄] .
As a consequence of (5.3.2)-(5.3.5) we obtain

(1) (2)
E sup Hi + Hi
b∈S

i
2 X
≤ 1+ E sup f (b, X̄ j+1 )K i+1 ( X̄ j , x̄) − m(b, x̄)
δi+1 b∈S,x̄∈[A,B]dm
j=−d+1
i
2 X
+ 1+ (1 − δi+1 ) E sup hi+1 (b, X̄j+1 )Ki+1 (X̄j , x̄)
δi+1

−E[hi+1 (b, X̄i+2 )|X̄i+1 = x̄]
i
const. X
≤ 2 sup E sup g(b, X̄j+1 )Ki+1 (X̄j , x̄)
δi+1 g∈G b∈S,x̄∈[A,B]dm
j=−d+1

−E[g(b, X̄i+2)|X̄i+1 = x̄],
and under the assumption (5.1.6) of the theorem we find that

(1) (2)
E sup Hi + Hi → 0 (i → ∞).
b∈S
Calculating expectations, summation n1 n−2

P
i=−1 ... and taking lim inf n→∞ in (5.3.1)
we end up with
n−1 n−1
! !
1X 1X
lim inf E m(b̂i , X̄i) − E m(b̃i , X̄i)
n→∞ n i=0 n i=0
n−2
1 X
≥ lim inf E ĥi+1 (b̃i+1 , X̄i+2 ) − ĥi+1 (b̃i, X̄i+1 )
n→∞ n i=−1
n−2
1 X
−E ĥi+1 (b̂i+1 , X̄i+2 ) − ĥi+1 (b̂i, X̄i+1 )
n i=−1
n−1
1X
+E δi ĥi (b̂i , X̄i+1 ) − ĥi (b̃i , X̄i+1 ) (5.3.6)
n i=0
=: lim inf {Di},
n→∞
{b̃i }i being the periodic strategy from Lemma 4.2.6. This is the analogue of
(4.2.12) we were looking for.
Finally, we observe that

ĥi+1 (bi+1 , X̄i+2 ) − ĥi+1 (bi , X̄i+1 ) = hi+1 (bi+1 , X̄i+2 ) − hi+1 (bi, X̄i+1 )

+ ĥi+1 (bi+1 , X̄i+2 ) − hi+1 (bi+1 , X̄i+2 ) + ci+1

+ hi+1 (bi , X̄i+1 ) − ĥi+1 (bi , X̄i+1 ) − ci+1 , (5.3.7)
where the expectation of the absolute value of the sum of the last two brackets
is bounded from above by
2Ekĥi+1 − hi+1 + ci+1 k∞ = Ekĥi+1 − hi+1 k → 0 (i → ∞), (5.3.8)
using (5.3.4), (5.3.5) and the assumption (5.1.6) of the theorem. (5.3.7) and
(5.3.8) show that for the purpose of the asymptotical inference in (5.3.6), we
can replace ĥi+1 by hi+1 in the definition of Di . We can then argue exactly in
the same way as in the proof of Theorem 4.2.1 (starting from (4.2.12)) to obtain
the optimality relation
n−1 n−1
!
1X 1X
lim inf E m(b̂i, X̄i ) − E m(bi, X̄i ) ≥ 0. 2
n→∞ n i=0 n i=0
It remains to prove Theorem 5.1.2:

Proof of Theorem 5.1.2. This will be done by application of Corollary 5.2.2
for the process {X̄i }i . Clearly, the GSM-property of {Xi }i makes {X̄i }i a GSM-
process, too.
Set G := {δi · hi , δi · f |i = 1, 2, ...}. Lemma 4.2.3 implies that kgk∞ ≤ kf k∞
for all g ∈ G. By assumption 3 of the theorem the requirements of Proposition
4.3.1 are met so that we can find a constant C > 0 with
|g(s, x̄) − g(t, x̄)| ≤ Cks − tk∞ (5.3.9)
for all g = δi · hi ∈ G, x̄ ∈ [A, B]dm and s, t ∈ S. Increasing C to at least

the Lipschitz constant of f , (5.3.9) holds for δi · f as well. Thus, the conditions
(5.2.2)-(5.2.4) of Theorem 5.2.1 are fulfilled for the class G. (5.2.5) holds because
of fX0 |X̄0 being Lipschitz continuous, and we conclude
i
1 X
2 sup E sup g(b, X̄j+1 )Ki+1 (X̄j , x̄)
δi+1 g∈G


1 log i
−E[g(b, X̄i+2 )|X̄i+1 = x̄] ≤ const. · 2 · α
δi+1 i
for some α > 0. By assumption 4 on {δi }i in Theorem 5.1.2, we find that

1 log i
2 · α −→ 0 (i → ∞),
δi+1 i
which finally yields the assertion. 2

151
CHAPTER 6
Portfolio selection functions in station-

ary return processes
In Chapter 5 we considered d-stage Markov processes in which portfolio selec-
tion could be done on the basis of the returns on the last d market days. By
the Markov property, the d-past completely characterized the stochastic regime
of the next market day. However, more general return processes {Xi }i , such as
merely stationary and ergodic processes will fail to have the Markov property.
Then, in principle, the investor is forced to evaluate the conditional log-optimal
portfolio b∗ (Xn , ..., X0) given the past returns X0 , ..., Xn starting from day zero.
As we will explain in Section 6.1, this is not always feasible. Choosing a port-
folio as a function of the d-past works well in Markov return processes. Hence
it is a natural modification of the conditional log-optimal strategy to consider
log-optimal portfolio selection functions fopt (·) : IRdm0 → S in stationary ergodic
return processes as well. These choose a portfolio from the portfolio simplex S
on the basis of the asset returns during the last d market periods (Section 6.1).
Log-optimal portfolio selection functions can only be calculated if one happens
to know the underlying return distribution. Otherwise the investor has to rely
on estimates.
Section 6.2 describes an estimator ZnL(·) for a log-optimal portfolio selection

function, with strong consistency results given in Lemma 6.2.1 and Theorem
6.2.2. The estimator works sequentially: the return data of the stocks is in-
cluded in the estimation process as it emerges. The central question is how
an investor using the estimated log-optimal portfolio selection function ZnL(·)
competes with other investors using different portfolio selection functions. As
we will see, repeated investment according to the estimate is optimal among all
investment strategies based on portfolio selection functions of the last d mar-
ket periods. In particular, it performs no worse than the unknown log-optimal
152 Chapter 6. Portfolio selection functions in stationary return processes
portfolio selection function fopt itself (Corollary 6.2.3).

In all this, L > 0 is a parameter of the underlying stock market characterizing
its regularity properties beyond stationarity and ergodicity. L is unknown to
the investor. To avoid this drawback in the cases relevant for practical applica-
tion, an adaptive estimator Zn (·) is constructed that does not require explicit
knowledge of L. This estimator features the same convergence properties as
ZnL (·) (Theorem 6.2.4), making it most appealing for application.
In Section 6.3 we prove the results of the preceding sections, and the chapter is
concluded by several simulations and examples (Section 6.4).
6.1 Portfolio selection functions

Let T ∈ IN be fixed and {Xi }∞ m
i=−T an [a, b] -valued (0 < a ≤ b < ∞), stationary
and ergodic process of return vectors in a market of m shares. As usual, at time
i, the return process up to and including Xi has been observed.
It is natural for the investor to choose his investment portfolio on the basis of
recently observed returns, say on the basis of the last d ∈ IN market periods
(d ≤ T fixed throughout). If investment performance is assessed on the basis
of logarithmic utility, the investor’s aim is to find a log-optimal portfolio
selection function of the d-past, i.e., a measurable function
m
X
b : [a, b]dm −→ S := {s ∈ IRm : sj = 1, sj ≥ 0}
j=1
such that (< ·, · > denoting the Euclidean scalar product)
E (log < b(X−d , ..., X−1), X0 >) ≥ E (log < f (X−d , ..., X−1), X0 >) (6.1.1)
for all measurable f : [a, b]dm −→ S. At time n ∈ IN0 , b advises the in-
vestor to allocate his wealth to the single shares according to the portfolio
b(Xn−d+1 , ..., Xn) = b(X̄n+1), where X̄n is a shorthand notation for the d-past
(Xn−d , ..., Xn−1) ∈ IRdm
+ of Xn .
In contrast to the conditional log-optimal portfolio b∗ , which is a function of

all past return data, we only include the last d observations. This brings about
several advantages:
6.1 Portfolio selection functions 153
– It is plausible that one should drop observations from the far-away past if
the stationarity of the market is not clear. Outdated observations (under
non-stationarity they appear to be drawn from a “wrong” distribution)
may endanger the performance of the conditional log-optimal portfolio.
We conjecture that portfolio selection functions are less sensitive to devi-
ations from stationarity of the return process. Finding empirical evidence
or counterevidence for this, however, is beyond the scope of this thesis.
– Each market has a specific log-optimal portfolio selection function, which

always remains the same in stationary markets. Log-optimal portfolio se-
lection functions are therefore much easier to interpret than conditional
log-optimal portfolios (which are not much of a single market character-
istic, but a sequence of functions). The shape of a log-optimal portfolio
selection function allows us to find structures in the stock quote chart
that should be interpreted as “buy”, “hold” and “sell” signals for the single
stocks. This makes log-optimal portfolio selection functions a theoreti-
cally well-founded counterpart of heuristic chart analysis (as presented,
e.g., in Möller, 1998).
– As already noted in Chapter 1, estimation of b∗ from market data is highly

problematic. As we shall see, estimation of log-optimal portfolio selection
functions, however, can exploit recent, powerful nonparametric regression
estimation algorithms in stationary ergodic processes.
In order to find a log-optimal portfolio selection function, we observe that

Z

E (log < b(X−d , ..., X−1), X0 >) = E log < b(x̄), X0 > |X̄0 = x̄ PX̄0 (dx̄).
Hence, it suffices to consider pointwise maximization of

R(s, x̄) := E log < s, X0 > |X̄0 = x̄
for fixed x̄. Here and in the sequel, the quantities s and x̄ are to be implicitly
understood as s ∈ S and x̄ ∈ IRdm+ .
Let KT (x̄) ⊆ S denote the set of Kuhn-Tucker-points (cf. Foulds, 1981), i.e.
the set of solutions of the convex maximization problem
R(s, x̄) −→ max! (6.1.2)

s∈S
Because of continuity of R(·, x̄) and because of S being a compact set, KT (x̄) 6=
∅, and the existence of solutions to (6.1.2) as well as to (6.1.1) is guaranteed.
R∗ (x̄) := max R(s, x̄)

s∈S
is the maximum of the target function and

Z
Rmax := R∗ (x̄)PX̄0 (dx̄) = max

E log < b(X̄0 ), X0 >
b:[a,b]dm →S
denotes the maximal expected logarithmic return.

To solve this maximization problem with historical return data (the return
distribution being unknown), Walk and Yakowitz (1999) and Walk (2000) have
suggested recursive estimation of log-optimal portfolio selection functions. For
this, we will use a nonparametric, strongly consistent regression estimation
scheme (i.e., with probability one, the estimates converge to the true regression
function in the pointwise sense). For a detailed overview of nonparametric
regression estimation (for non-i.i.d., i.e. dependent data), we refer the reader
to Bosq (1996), Härdle (1990) and Härdle et al. (1998).
Until recently, for non-i.i.d. data, only nonparametric regression estimators
were available for which strong consistency was linked up with appropriate
mixing conditions (Györfi, Härdle et al., 1989, and Bosq, 1996). These ensure a
suitable decay of dependency in the data. Others made regularity assumptions
on conditional densities of the process variables (Laib, 1999, Laib and Ould-
Said, 1996). On the other hand, the examples in Györfi et al. (1998) show
that we actually have to impose some regularity conditions on the process in
order to be able to obtain strong consistency. Correspondingly, the consistency
proofs for the estimated log-optimal portfolio selection functions in Walk and
Yakowitz (1999) and Walk (2000) also rely on mixing conditons.
However, mixing conditions can hardly be verified from observational data us-
ing some statistcal testing procedure. Now, with the work of Yakowitz, Györfi
et al. (1999) an algorithm has been proposed that achieves strong consistency
under other conditions than mixing. The mixing requirement, ensuring a suit-
able decay of dependency in the data, was replaced by a condition on the fi-
nite dimensional (more precisely the d-dimensional) distribution of the process,
namely a Lipschitz condition on the regression function. In particular, processes
featuring long-range dependence (which must be expected in financial data, see
6.2 Estimation of log-optimal portfolio selection functions 155
e.g. Ding et al., 1993; Peters, 1997) are not precluded from consideration as
they would have been under mixing conditions.
In this chapter, the estimator of Yakowitz, Györfi et al. (1999) is combined
with a stochastic projection algorithm (Kushner and Clark, 1978) to obtain
a strongly consistent sequential estimator for a log-optimal portfolio selection
function of the d-past of the return process. The mixing conditions in Walk
and Yakowitz (1999) and Walk (2000) are replaced by a Lipschitz condition on
the gradient of the target function.
6.2 Estimation of log-optimal portfolio selection

functions
Throughout this chapter we assume that the following regularity conditions V1
and V2 hold:
V1: {Xi }∞ m
i=−T (T ∈ IN) is an [a, b] -valued stationary ergodic stochastic process
on a probability space (Ω, A, P) (0 < a ≤ b < ∞ need not be known
explicitly). Some d ∈ IN (d ≤ T ) is fixed.
V2: The gradient of the target function R(s, x̄),

X0
m(s, x̄) := E X̄0 = x̄
< s, X0 >
(which we already know from the Kuhn-Tucker conditions in Theorem
1.3.3), is a Lipschitz continuous function of x̄ with Lipschitz constant
√
L/ md, i.e.
L
|m(s, x̄) − m(s, ȳ)| ≤ √ |x̄ − ȳ| for all x̄, ȳ ∈ IRdm
+ , s ∈ S.
md
Condition V2 is fulfilled if the conditional distribution PX0 |X̄0 has a density

fX0 |X̄0 (x0 , x̄) with respect to some measure µ, such that
La
|fX0 |X̄0 (x0, x̄) − fX0 |X̄0 (x0, ȳ)| ≤ √ |x̄ − ȳ|
m d · b µ([a, b]m)
(note the similarity to the Lipschitz conditions of Theorem 5.1.2). In particular,
this holds if fX0 |X̄0 is continuously differentiable. Hence, V2 is a condition on
the variability of the return vectors and as such a condition on the risk inherent
in the market.
At time n, the investor’s task is to produce an estimate ZnL (x̄) of the value of a
log-optimal portfolio selection function given the last d observed return vectors
are x̄ ∈ IRdm . This can be done by the following projection algorithm:
1. Before we start the estimation process, we fix a partition Pk of IRdm + into

cubes of volume (2 ) for each positive integer k. For x̄ ∈ IRdm
−k−2 dm
+ the
element of Pk containing x̄ is denoted by Ak (x̄). We also fix some sequence
αn > 0 (n ∈ IN).
2. Then, at time n, we calculate a partitioning regression estimate of the gra-

dient m(s, x̄) of the target function (Yakowitz, Györfi et al., 1999): More
precisely, for Nn ∈ IN with limn→∞ Nn = ∞ and X̄j := (Xj−d , ..., Xj−1)
we construct the gradient estimate by
Nn
ˆ k,n,L (s, x̄),
X
m̂n,L(s, x̄) = M̂1,n (s, x̄) + ∆
k=1
using
   
n n
Xj · 1Ak (x̄)(X̄j )
X X
M̂k,n (s, x̄) :=    1Ak (x̄)(X̄j ) , (6.2.1)
< s, Xj >
j=−M j=−M
ˆ k,n,L (s, x̄) := TL2−k (M̂k,n (s, x̄) − M̂k−1,n (s, x̄)).
∆
Here, M := T − d ∈ IN0 is the length of the training period of the algo-

rithm before the first estimate is produced. TL2−k denotes the truncation
operator, defined for z = (z1 , ...zm) ∈ IRm by TL2−k z = (w1 , ...wm) with
wi := sgn zi · min{|zi |, L2−k }.
3. Having obtained an estimate of the gradient of the target function, we ap-

ply the classical projection algorithm to estimate the maximum in (6.1.2)
(Kushner and Clark, 1978, Sec. 5.3, also used in different form by Walk
L
and Yakowitz, 1999): From the previous estimate Zn−1 (x̄) we calculate
an updated estimate by
ZnL (x̄) := Π Zn−1

L L

(x̄) + αn m̂n,L(Zn−1 (x̄), x̄) . (6.2.2)
Here, for x ∈ IRm , Π(x) denotes the best approximating (in the Euclidean
norm) element of x in the simplex S, i.e. the projection of x on S. To
start the iteration at time n = 0, we use an arbitrary starting estimate
L
Z−1 (x̄) ∈ S.
Note that we assume L to be known. At a later stage the algorithm is modified

so as to comprise an adaptive choice of the market parameter L, which then
allows estimation without knowledge of the precise value of L.
The following lemma featuring the basic convergence properties of the estimate
will be crucial to the main results of this chapter. It shows that our estimates
approach the set KT (x̄) of Kuhn-Tucker points of (6.1.2), i.e., the collection of
values of log-optimal portfolio-selection functions at x̄.
Lemma 6.2.1. Let ρ ZnL(x̄), KT (x̄) := inf y∈KT (x̄) kZnL (x̄) − yk denote the

Euclidean distance of ZnL (x̄) from the set KT (x̄). Then, under the assumptions
1. V1 and V2,
P∞
2. αn −→ 0 (n → ∞) and n=0 αn = ∞,
we have that for PX̄0 -a.a. x̄:
lim ρ ZnL (x̄), KT (x̄) = 0 P − a.s..

n→∞
To formulate this result more neatly, let ZnL∗(x̄) denote the best approximating
(in the Euclidean metric) element of ZnL(x̄) in KT (x̄) (observe that KT (x̄) is
compact). Note that ZnL∗ (x̄) is a log-optimal portfolio selection function as x̄
varies. Then Lemma 6.2.1 can be rephrased more explicitly as
Theorem 6.2.2. Under the assumptions of Lemma 6.2.1 one has
1. Pointwise strong consistency for PX̄0 -a.a. x̄:
|ZnL (x̄) − ZnL∗(x̄)| −→ 0 (n → ∞) P − a.s.
and
R(ZnL (x̄), x̄) −→ R∗ (x̄) (n → ∞) P − a.s.. (6.2.3)
2. Strong L1 -consistency:
Z
|ZnL (x̄) − ZnL∗ (x̄)|PX̄0 (dx̄) −→ 0 (n → ∞) P − a.s.
and
Z
R(ZnL (x̄), x̄)PX̄0 (dx̄) −→ Rmax (n → ∞) P − a.s., (6.2.4)
hence also in Lr (P) for any r ∈ IN .
The limit relations (6.2.3) and (6.2.4) are the central results for the proposed
estimation procedure. They demonstrate that in the long run ZnL (x̄) almost
surely achieves the optimal expected growth of wealth among all strategies
based on portfolio selection functions of the d-past.
Remark concerning Lemma 6.2.1 and Theorem 6.2.2. The limit relations
in Lemma 6.2.1 and Theorem 6.2.2, part 1 are true even in the stronger sense
that a fixed exceptional null set of ω ∈ Ω and a fixed exceptional null set of
x̄ ∈ [a, b]dm exist, outside which for all ω and x̄
ρ ZnL(x̄), KT (x̄) −→ 0,

|ZnL (x̄) − ZnL∗(x̄)| −→ 0 and

R(ZnL (x̄), x̄) −→ R∗ (x̄)
as n → ∞ (cf. proof of Theorem 6.2.2).

Until now, we have merely considered the problem how to estimate a log-optimal
portfolio selection function. Of course this is not the primary task of the in-
vestor. He would like to actually use a log-optimal portfolio selection function
or – in case such a function is unknown – the estimates ZnL to rebalance his
investment porfolio. At time i ∈ IN0 , ZiL (x̄) is the most recent estimate of a
log-optimal portfolio selection function b(x̄) in (6.1.1). The investor therefore
takes ZiL(X̄i+1 ) = ZiL(Xi+1−d , ..., Xi) as the investment scheme to be used at
time i ∈ IN0 . The accumulated investment returns up to time n ∈ IN are
n−1
Y
Rn := < ZiL(X̄i+1 ), Xi+1 > .
i=0
The following corollary shows that, asymptotically, the investment strategy

ZiL (X̄i+1 ) is superior to any other strategy using a portfolio selection function
of the last d market periods (pathwise competitive optimality).
Corollary 6.2.3. Suppose the support of the distribution PX0 is not confined
to a hyperplane in IRm containing the diagonal {(d, ..., d)T ∈ IRm |d ∈ IR}. For
any measurable portfolio selection function f : IRdm+ −→ S with accumulated
returns
n−1
Y
Vn := < f (X̄i+1 ), Xi+1 >
i=0
we have
1 Vn
lim sup log ≤ 0 P − a.s..
n→∞ n Rn
Lemma 6.2.1 and Theorem 6.2.2 are statements corresponding to Theorem 1 in

Walk and Yakowitz (1999), Corollary 6.2.3 is a generalisation of Corollary 1 in
Walk (2000). However, the statements are valid under fundamentally different
(in fact, considerably weaker) assumptions.
As already mentioned, in practical applications the exact value of L is not
disclosed to the investor. On the other hand, one can assume that, as the share
prices take on rational, i.e. countably many values only, so does the return
process {Xi}∞ i=−T as a process of price ratios. In this situation an adaptive
choice of L can be carried out in the following way:
Having fixed a sequence γn ∈ IN, γn −→ ∞, for the nth investment step a
random variable
n
1 X
Ln := arg max log < ZnK (X̄j ), Xj > (6.2.5)
K∈{1,...,γn } n + M + 1
j=−M
is defined, and the estimate of a log-optimal portfolio selection function b(x̄) is
Zn (x̄) := ZnLn (x̄).
For this procedure, we have
Theorem 6.2.4. Assume the distribution PX0 is supported on a denumerable

set and the support of the distribution is not confined to a hyperplane in IRm
containing the diagonal {(d, ..., d)T ∈ IRm |d ∈ IR}. Then Lemma 6.2.1, Theo-
rem 6.2.2 and Corollary 6.2.3 remain valid if ZnL is replaced by Zn .
For the regression estimator of Yakowitz, Györfi et al. (1999) it is not yet known
whether there exists an adaptive rule for the choice of the Lipschitz constant L
generating a procedure that is strongly consistent for arbitrary stationary er-
godic processes with Lipschitz continuous regression function. Theorem 6.2.4 is
remarkable because it asserts that for the application of the regression estimate
to the portfolio optimization problem such adaptation can be achieved.
We finish this section with two remarks about extensions of the stated results.
Remark concerning (6.2.1). Lemma 6.2.1, the first part of Theorem 6.2.2
and Theorem 6.2.4 still hold if we use kernel estimates
   
n X n
X X j X̄ j − x̄ X̄ j − x̄
M̂k,n (s, x̄) :=  K   K 
< s, Xj > hk hk
j=−M j=−M
instead of the partitioning estimates in (6.2.1). For this, we choose a continuous

kernel function K : IRdm + −→ IR+ having compact support, K(0) > 0, and
−k−2
bandwidths hk := 2 (Yakowitz, Györfi et al., 1999, Sec. 3). However, as we
shall see, the argument in the proof of the second part of Theorem 6.2.2 breaks
down unless the distribution of X̄0 is supported on a denumerable set.
Remark concerning (6.2.5). (6.2.5) can be seen as an application of the
principle of empirical risk minimization. By (6.2.2) random classes of functions
ZnL : [a, b]dm → S are defined, parametrized by admissible step widths α0 , ..., αn
and potential Lipschitz constants L. An estimator is picked out minimizing the
empirical risk, here the negative empirical mean return. This can also be used
to choose suitable step widths. In fact, the same arguments as used in the proof
(k,α ,...,α )
of Theorem 6.2.4 show: If Zn = Zn 0 n is an estimator constructed with
step widths α0 , ..., αn and Lipschitz parameter k such that
n
1 X
log < Zn(k,α0 ,...,αn )(X̄j ), Xj > (6.2.6)
n+M +1
j=−M
n
1 X
≥ log < Zn(L,1,1,1/2,...,1/n) (X̄j ), Xj >
n+M +1
j=−M
(for sufficiently large n), then Theorem 6.2.4 remains valid for this Zn .
6.3 Checking the properties of the estimation algorithm 161
6.3 Checking the properties of the estimation

algorithm
We now move on to the proof of the statements of the preceding section.
6.3.1 Proof of the convergence Lemma 6.2.1

The algorithm (6.2.2) is an application of a classical projection algorithm for
problem (6.1.2), using an estimate of the gradient of the target function,

∂ X0
m(s, x̄) = E log < s, X0 > |X̄0 = x̄ = E X̄0 = x̄ ,
∂s < s, X0 >
for s ∈ S. The gradient estimate is obtained via a partitioning method in
(6.2.1). For this reason, before we can turn to the proof of the crucial conver-
gence Lemma 6.2.1, we have to formulate some preliminary results on consis-
tency properties of the gradient estimate.
The data the statistician can access at time n ∈ IN0 ,
n
(s) Xi
X̄i := (Xi−d , ..., Xi−1), Yi := ,
< s, Xi > i=−M
are drawn from a stationary and ergodic process. Indeed, referring to Stout
(1974), Theorem 3.5.8, as {Xi}∞ i=−d−M is stationary and ergodic, so is the
stochastic process {(Xi−d , ..., Xi)}∞ i=−M . This follows from the cited theorem,
n(d+1) n(d+1)
observing that < s, Xi > > 0 for all s ∈ S, so that fs : IR+ −→ IR+ :
(x1 , ..., x1+d) 7−→ (x1, ..., xd, x1+d / < s, x1+d >) is a well-defined measurable
mapping.
(s)
Moreover, Yi is bounded and Lipschitz continuous in s because of
√
qP
m 2
j=1 Xi,j √ b

Xi mb2
= Pm
< s, Xi > ≤ P m = m (6.3.1)
j=1 sj Xi,j a j=1 sj a
and (applying the Cauchy-Schwarz inequality)

Xi Xi
−
< s, Xi > < t, Xi >

1 1 | < t − s, Xi > |
≤ − · |Xi | = · |Xi |
< s, Xi > < t, Xi > | < s, Xi >< t, Xi > |
|Xi |2 mb2
≤ · |s − t| ≤ Pm Pm · |s − t|
| < s, Xi >< t, Xi > | j=1 sj Xi,j · j=1 tj Xi,j
mb2 mb2
≤ Pm Pm · |s − t| = 2 · |s − t| (s, t ∈ S). (6.3.2)
j=1 sj a · j=1 tj a a
Yakowitz, Györfi et al. (1999) propose a strongly consistent estimator for a

Lipschitz continuous regression function based on stationary ergodic data. The
(s)
gradient m(s, x̄) is the regression function of Yi on X̄i , so that we may use
this estimator to obtain a gradient estimate. To this end, let Pk be a partition
of IRdm
+ into cubes of volume (2
−k−2 dm
) . Ak (x̄) denotes the element of Pk in
dm
which x̄ ∈ IR+ comes to lie. Then m(s, x̄) is estimated by
Nn
ˆ k,n,L (s, x̄)
X
m̂n,L (s, x̄) := M̂1,n (s, x̄) + ∆
k=1
with
   
n Xn
X Xj
M̂k,n (s, x̄) :=  1A (x̄) (X̄j )  1Ak (x̄)(X̄j ) (6.3.3)
< s, Xj > k
j=−M j=−M
ˆ k,n,L (s, x̄) := TL2−k (M̂k,n (s, x̄) − M̂k−1,n (s, x̄))
∆ (6.3.4)
and some fixed sequence Nn ∈ IN satisfying limn→∞ Nn = ∞.

The estimator is motivated by the telescopic expansion
∞
X
m(s, x̄) = lim Mk (s, x̄) = M1 (s, x̄) + ∆k (s, x̄) (6.3.5)
k→∞
k=2
of the limit relation

X0
Mk (s, x̄) := E X̄0 ∈ Ak (x̄) −→ m(s, x̄)
< s, X0 >
for PX̄0 -a.a. x̄, where ∆k (s, x̄) := Mk (s, x̄) − Mk−1 (s, x̄) (Yakowitz, Györfi et
al., 1999, eq. (4)). In addition, a truncated version of the expansion is defined
by
X∞
mL(s, x̄) := M1 (s, x̄) + ∆k,L (s, x̄)
k=2
with ∆k,L (s, x̄) := TL2−k ∆k (s, x̄).

The convergence of the components (6.3.3) and (6.3.4) in the definition of the
estimator to the corresponding components in expansion (6.3.5) is given by the
following
Lemma 6.3.1. Under the assumption V1 one has for PX̄0 -a.a. x̄ and any fixed
s∈S
1. M̂1,n (s, x̄) − M1 (s, x̄) −→ 0 (n → ∞) P − a.s.,

2. ˆ k,n,L (s, x̄) − ∆k,L (s, x̄) −→ 0 (n → ∞) P − a.s..
∆
Proof. Straightforward application of the ergodic theorem (Stout, 1974, The-

orem 3.5.7, Yakowitz, Györfi et al., 1999, eq. (16)) yields
M̂k,n (s, x̄) −→ Mk (s, x̄) (n → ∞) P − a.s.,
in particular the first part of the lemma. Since the truncation operator is itself
Lipschitz continuous with Lipschitz constant 1, the second part of the lemma
is obtained by

ˆ
∆k,n,L (s, x̄) − ∆k,L (s, x̄)

≤ TL2−k (M̂k,n (s, x̄) − M̂k−1,n (s, x̄)) − TL2−k (Mk (s, x̄) − Mk−1 (s, x̄))

≤ M̂k,n (s, x̄) − Mk (s, x̄) + Mk−1 (s, x̄) − M̂k−1,n (s, x̄)

−→ 0 P − a.s. (n → ∞).
2
In Yakowitz, Györfi et al. (1999) a lemma analogous to the above one is used to
obtain pointwise strong consistency of m̂n,L(s, x̄). However, the proof of Lemma
6.2.1 requires convergence to hold uniformly in S, which will be derived from
the following lemma.
Lemma 6.3.2. Let S ⊆ IRd be some compact set, K > 0 and (fn )n∈IN a class
of functions fn : S −→ IRd with |fn (s) − fn (t)| ≤ K|s − t| for all s, t ∈ S.
Then limn→∞ fn (s) = 0 for all s ∈ S implies that limn→∞ sups∈S |fn (s)| = 0.
Proof. Let δ > 0 be arbitrary but fixed. Choose a finite δ-net N in S. Then
for all s ∈ S there exists some ts ∈ N with |s − ts | ≤ δ. This yields
|fn (s)| ≤ |fn (s) − fn (ts )| + |fn (ts)| ≤ K|s − ts| + |fn (ts )| ≤ K · δ + |fn (ts )|
and
sup |fn (s)| ≤ K · δ + sup |fn (t)|.
s∈S t∈N
Since N is finite, one has
lim sup sup |fn (s)| ≤ K · δ + lim sup sup |fn (t)| = K · δ + 0 = K · δ.
n→∞ s∈S n→∞ t∈N
The assertion follows from δ being arbitrary. 2

There are two consequences of Lemma 6.3.2 which we are going to need.
Consequence 1: For PX̄0 -a.a. x̄ we have

lim sup M̂1,n (s, x̄) − M1 (s, x̄) = 0 P − a.s.. (6.3.6)

n→∞ s∈S
This is a consequence of Lemma 6.3.2 with fn (s) := M̂1,n (s, x̄) − M1 (s, x̄).
According to Lemma 6.3.1 limn→∞ fn (s) = 0 P-a.s. and |fn (s) − fn (t)| ≤
|M̂1,n (s, x̄) − M̂1,n (t, x̄)| + |M1 (s, x̄) − M1 (t, x̄)| hold for any s ∈ S and for PX̄0 -
a.a. x̄. To bound the terms in the latter expression, one uses
|M̂k,n (s, x̄) − M̂k,n (t, x̄)|

 
n n
X X j X j X
≤ − 1Ak (x̄) (X̄j )  1Ak (x̄)(X̄j )
j=−M < s, Xj > < t, Xj > j=−M
2

Xj Xj ≤ mb |s − t|

≤ max − (6.3.7)
−M≤j≤n < s, Xj > < t, Xj > a2
(the latter inequality from (6.3.2)) and
|Mk (s, x̄) − Mk (t, x̄)|

2
X0 X0 mb
≤E − X̄0 = x̄ ≤ E |s − t|X̄0 = x̄

< s, X0 > < t, X0 > a2
mb2
= 2 |s − t|. (6.3.8)
a
Altogether this yields
mb2 mb2 mb2

|fn (s) − fn (t)| ≤ |s − t| + |s − t| = 2 |s − t|
a2 a2 a2
for all s, t ∈ S, the situation as required in Lemma 6.3.2, and (6.3.6) is valid.
Consequence 2: Let R ∈ {2, 3, ...} be fixed. Then for PX̄0 -a.a. x̄

XR
lim sup
ˆ k,n,L (s, x̄) − ∆k,L (s, x̄) = 0 P − a.s.
∆ (6.3.9)
n→∞ s∈S
k=2
holds.
PR ˆ
Here, put fn (s) := k=2 ∆k,n,L (s, x̄) − ∆ k,L (s, x̄) . For any fixed s ∈ S and

for PX̄0 -a.a. x̄ one has limn→∞ ∆ ˆ k,n,L (s, x̄) − ∆k,L (s, x̄) = 0 P-a.s. according
to Lemma 6.3.1 and hence limn→∞ fn (s) = 0 P-a.s. for all s ∈ S. Moreover,
R R
ˆ k,n,L (s, x̄) − ∆
ˆ k,n,L (t, x̄) +
X X
|fn (s) − fn (t)| ≤ ∆ ∆k,L (s, x̄) − ∆k,L (t, x̄).
k=2 k=2
The first term can be bounded using

ˆ ˆ k,n,L (t, x̄)
∆k,n,L (s, x̄) − ∆

= TL2−k (M̂k,n (s, x̄) − M̂k−1,n (s, x̄)) − TL2−k (M̂k,n (t, x̄) − M̂k−1,n (t, x̄))

≤ M̂k,n (s, x̄) − M̂k,n (t, x̄) + M̂k−1,n (s, x̄) − M̂k−1,n (t, x̄)

mb2
≤ 2 |s − t|
a2
(with (6.3.7)), and we obtain
mb2
|fn (s) − fn (t)| ≤ 2(R − 1) |s − t|
a2
R
X R
X
+ |Mk (s, x̄) − Mk (t, x̄)| + |Mk−1 (s, x̄) − Mk−1 (t, x̄)|
k=2 k=2
for all s, t ∈ S. Appealing to inequality (6.3.8) this becomes

mb2
|fn (s) − fn (t)| ≤ 4(R − 1) |s − t|,
a2
and the requirements of Lemma 6.3.2 are met. The assertion (6.3.9) follows.
Finally we establish the desired strong consistency of the gradient estimate
m̂n,L(s, x̄) uniformly in S.
Lemma 6.3.3. Under the assumption V1
lim sup |m̂n,L (s, x̄) − m(s, x̄)| = 0 P − a.s.

n→∞ s∈S
holds for PX̄0 -a.a. x̄.
Proof. In view of (6.3.6) it suffices to show that for PX̄0 -a.a. x̄

N
X n ∞
ˆ k,n,L (s, x̄) −
X
lim sup ∆ ∆k,L (s, x̄) = 0 P − a.s..

n→∞ s∈S
k=2 k=2
Let R ∈ {2, 3, ...} be arbitrary. For sufficiently large n we have Nn > R and as
in Yakowitz, Györfi et al. (1999), proof of Theorem 1, we obtain
N
X n ∞
ˆ k,n,L (s, x̄) −
X
∆ ∆k,L (s, x̄)

k=2 k=2
R Nn ∞
X
ˆ k,n,L (s, x̄) − ∆k,L (s, x̄) + ˆ k,n,L (s, x̄)| +
X X
≤ ∆ |∆ |∆k,L (s, x̄)|

k=2 k=R+1 k=R+1
The last two terms are bounded by

Nn ∞
ˆ k,n,L (s, x̄)| +
X X
|∆ |∆k,L (s, x̄)|
k=R+1 k=R+1
Nn
X ∞
X ∞
X
−k −k
≤ L·2 + L·2 ≤ 2L 2−k = 2−(R−1) · L.
k=R+1 k=R+1 k=R+1
Hence
N
X n X∞
sup
ˆ k,n,L (s, x̄) −
∆ ∆k,L (s, x̄)

s∈S
k=2 k=2

X R
≤ sup
ˆ
∆k,n,L (s, x̄) − ∆k,L (s, x̄) + 2−(R−1) · L.

s∈S
k=2
Using (6.3.9) we derive

N
X n ∞
ˆ k,n,L (s, x̄) −
X
lim sup sup ∆ ∆k,L (s, x̄) ≤ 2−(R−1) · L

n→∞ s∈S
k=2 k=2
P-a.s., and the assertion follows letting R → ∞. 2

Having obtained these preliminary results, we can now move on to proving
Lemma 6.2.1. First, recall that the projection algorithm (6.2.2) is given by the
recurrence relation (n ∈ IN0)

L
arbitrary and ZnL(x̄) := Π Zn−1
L L

Z−1 (x̄) ∈ S (x̄) + αn m̂n,L(Zn−1 (x̄), x̄) ,
(6.3.10)
where
Nn
X
m̂n,L (s, x̄) = M̂1,n (s, x̄) + ˆ k,n,L (s, x̄).
∆
k=1
Posing this as
ZnL(x̄) = Π Zn−1
L L

(x̄) + αn m(Zn−1 , x̄) + αn βn (x̄)
with
L L
βn (x̄) := m̂n,L(Zn−1 (x̄), x̄) − m(Zn−1 (x̄), x̄), (6.3.11)
the projection algorithm (6.3.10) is a special case of the projection algorithm
Wn = Π (Wn−1 + αn (m(Wn−1 ) + ξn + βn ))
in Kushner and Clark (1978, eq. 5.3.1) for ξn := 0. Their Theorem 5.3.1
adapted to the case ξn := 0 reads
Lemma 6.3.4. Under the assumptions
1. m(·, ·) is continuous,
P∞
2. αn > 0 with αn −→ 0 for n → ∞ and n=0 αn = ∞,
3. there exists a c ≥ 0 with |βn (x̄)| ≤ c < ∞ for all n ∈ IN0 ,
4. βn (x̄) −→ 0 P-a.s. for n → ∞
the projection algorithm converges,
lim ρ(ZnL (x̄), KT (x̄)) = 0 P − a.s..

n→∞
At last, this lemma allows us to give a

Proof of Lemma 6.2.1. The crucial point is to show that under the assump-
tions V1 and V2 and for PX̄0 -a.a. x̄ parts 3 and 4 of Lemma 6.3.4 hold when
we choose the βn defined by (6.3.11).
To this end, we first note that as in Yakowitz, Györfi et al. (1999, Corollary 1)
assumption V2, i.e., the existence of a constant L with
L
|m(s, x̄) − m(s, ȳ)| ≤ √ |x − y| for all x̄, ȳ ∈ IRdm
+ , s ∈ S,
md
implies
m(s, x̄) = mL(s, x̄).
Concerning part 3, one has

L L L L
|m̂n,L (Zn−1 , x̄) − m(Zn−1 , x̄)| = |m̂n,L (Zn−1 , x̄) − mL(Zn−1 , x̄)|
Nn
L L
X ˆ L L
≤ |M̂1,n (Zn−1 , x̄) − M1 (Zn−1 , x̄)| + ∆k,n,L (Zn−1 , x̄) − ∆k,L (Zn−1 , x̄)

k=2
∞ X ∞
L L
X ˆ L ∆k,L (Z L , x̄)

≤ |M̂1,n (Zn−1 , x̄)| + |M1 (Zn−1 , x̄)| + ∆k,n,L (Zn−1 , x̄) +

n−1
k=2 k=2
∞
X
L
≤ |M̂1,n (Zn−1 L
, x̄)| + |M1 (Zn−1 , x̄)| + 2L 2−k .
k=2
Combining this with the inequalites

n n
L
X Xj X
|M̂1,n (Zn−1 , x̄)| =
1
L , X > A1 (x̄)
( X̄ )
j
1A1 (x̄) ( X̄ j )
j=−M
< Zn−1 j j=−M

Xj
≤ max L
,
−M≤j≤n < Zn−1 , Xj >

L
X0 X0
|M1 (Zn−1 , x̄)| = E
L
X̄0 ∈ A1 (x̄) ≤ sup
L

< Zn−1 , X0 > ω∈Ω < Zn−1 , X0 >

we get
L L
|m̂n,L(Zn−1 , x̄) − m(Zn−1 , x̄)|

X j X0 +2· L

≤ max
L , X >
+ sup

L
−M≤j≤n < Zn−1 j ω∈Ω < Zn−1 , X0 >
2
√ b
≤ 2 m + L =: c,
a
the latter inequality due to (6.3.1).
Condition 4. is readily verified from
L L
|m̂n,L(Zn−1 , x̄) − m(Zn−1 , x̄)| ≤ sup |m̂n,L(s, x̄) − mL(s, x̄)|
s∈S
and Lemma 6.3.3. This finishes the proof of Lemma 6.2.1. 2

6.3.2 Proof of the related Theorems 6.2.2 - 6.2.4

Proof of Theorem 6.2.2. For any fixed x̄, R(·, x̄) is a continuous function on
the compact set S, thus uniformly continuous. Part 1 of the theorem directly
follows from Lemma 6.2.1.
To prove part 2, observe that (6.3.3) involves only denumerably many functions
1Ak (x̄) (for fixed k and all possible values of x̄). Thus, the exceptional set of
ω ∈ Ω in Lemma 6.3.1 can be made independent from the chosen x̄. This
continues throughout the proof of Lemma 6.2.1. Hence as n → ∞ we have for
P ⊗ PX̄0 -a.a. (ω, x̄)
ρ ZnL (x̄), KT (x̄) −→ 0,

|ZnL (x̄) − ZnL∗(x̄)| −→ 0, (6.3.12)

∗
R(ZnL (x̄), x̄) −→ R (x̄).
From part 1, the Lebesgue dominated convergence theorem yields the assertions
of the second part of the theorem. The limit relation (6.3.12), now valid P-a.s.
for PX̄0 -a.a. x̄, and
√
|ZnL (x̄) − ZnL∗ (x̄)| ≤ max |s − t| ≤ m
s,t∈S
imply Z
|ZnL (x̄) − ZnL∗(x̄)|PX̄0 (dx̄) −→ 0 (n → ∞)
P − a.s. and in Lr (P) (r ∈ IN). The same arguments, starting from
|R(ZnL (x̄), x̄) − R∗ (x̄)|

≤ |R(ZnL (x̄), x̄)| + |R∗ (x̄)| ≤ 2 max | log < s, y > | < ∞,
s∈S,y∈[a,b]m
yield the remaining parts of the proof. 2

Proof of Corollary 6.2.3. The assumption on the support of PX0 implies
that an essentially, i.e., PX̄0 -a.s. unique log-optimal portfolio selection function
w(x̄) := arg max E[log < s, X0 > |X̄0 = x̄]

s∈S
exists (Algoet and Cover, 1988, p. 877, corrected in Österreicher and Vajda,
1993, and Vajda and Österreicher, 1994). The accumulated return using the
log-optimal portfolio selection function for investment is denoted by

n−1
Y
Rn∗ := < w(X̄i+1 ), Xi+1 > .
i=0
It follows that
1 Vn 1 Vn 1 R∗
lim sup log ≤ lim sup log ∗ + lim sup log n . (6.3.13)
n→∞ n Rn n→∞ n Rn n→∞ n Rn
For the first term on the right hand side,the ergodic theorem and the optimality
of w imply
1 Vn
lim sup log ∗
n→∞ n Rn
n−1
1 X < f (X̄i+1 ), Xi+1 >
= lim sup log
n→∞ n i=0
< w(X̄i+1 ), Xi+1 >

< f (X̄0 ), X0 >
= E log
< w(X̄0 ), X0 >
Z

= E[log < f (x̄), X0 > |X̄0 = x̄] − E[log < w(x̄), X0 > |X̄0 = x̄] PX̄0 (dx̄)
≤ 0. (6.3.14)
Arguing along the lines of Walk (2000, Corollary 1), the second term has limiting
behaviour
1 R∗
lim sup log n
n→∞ n Rn
n−1
1X
log < w(X̄i+1), Xi+1 > − log < ZiL(X̄i+1 ), Xi+1 >

= lim sup
n→∞ n
i=0
= 0. (6.3.15)
This is seen as follows: (6.3.12) combined with Egorov’s theorem shows that
for each > 0 we can find sets Ω̃ ⊆ Ω and I˜ ⊆ [a, b]dm such that P(Ω̃) ≥ 1 − ,
˜ ≥ 1 − and
PX̄0 (I)
ZnL → w ˜
uniformly on Ω̃ × I. (6.3.16)
Then

n−1
1 X
L

log < w(X̄i+1), Xi+1 > − log < Zi (X̄i+1 ), Xi+1 >

n

i=0
n−1
1 X
log < w(X̄i+1 ), Xi+1 > − log < ZiL (X̄i+1 ), Xi+1 > · 1I˜(X̄i+1 )

≤
n i=0
n−1
1 X
log < w(X̄i+1 ), Xi+1 > − log < ZiL (X̄i+1 ), Xi+1 > · 1I˜c (X̄i+1 )

+
n i=0
n−1 n−1
c X cX
w(X̄i+1 ) − ZiL(X̄i+1 ) · 1I˜(X̄i+1 ) +

≤ 1 ˜c (X̄i+1 )
n i=0 n i=0 I
for some constant c > 0. The first term tends to zero by (6.3.16), the second
term to c · PX̄0 (I˜C ) ≤ c · . Now, let go to zero.
Finally, (6.3.14) and (6.3.15) plugged into (6.3.13) finish the proof. 2
Proof of Theorem 6.2.4. It suffices to prove Lemma 6.2.1 for Zn instead of
ZnL . Without loss of generality one can set M = 0. In the following we assume
n to be sufficiently large such that γn ≥ L. Because X̄0 takes on values in
a denumerable set X , for any > 0 there exists a finite subset X̄ ⊆ X with
P(X̄0 ∈ X̄ C ) ≤ . As in the proof of Corollary 6.2.3, w(·) denotes the essentially
unique log-optimal portfolio selection function.
For ω ∈ Ω, x̄ ∈ X , we consider the sequence
ZnLn (x̄) = ZnLn (x̄, ω) ∈ S
and show that for PX̄0 -a.a. x̄ a P-a.s. limit relation ZnLn (x̄, ω) −→ w(x̄) holds.
To establish this, we work through the following: Consider an accumulation
point fω (x̄) of the sequence, say
L
Zn0n0 (x̄, ω) −→ fω (x̄)
for a subsequence n0 , and show that P-a.s.
fω (x̄) = w(x̄) for PX̄0 − a.a. x̄. (6.3.17)
Indeed, (6.3.17) implies the existence of a set A ∈ A, P(A) = 1, such that for
any ω ∈ A
ZnLn (x̄, ω) −→ w(x̄) (6.3.18)
for PX̄0 -a.a. x̄. This is seen as follows: Assume we had an x̄ with PX̄0 ({x̄}) > 0
and P(A(x̄)) > 0, where A(x̄) := {ω|ZnLn (x̄, ω) 9 w(x̄)}. P(A(x̄)) > 0 implies
A(x̄) ∩ A 6= ∅, hence an ω ∈ A(x̄) ∩ A exists with ZnLn (x̄, ω) 9 w(x̄) on the one
hand (according to the construction of A(x̄)), and ZnLn (x̄, ω) → w(x̄) on the
other hand (according to (6.3.18) and PX̄0 ({x̄}) > 0). This is a contradiction.
Hence, for PX̄0 -a.a. x̄, we have P(A(x̄)) = 0, i.e.
ZnLn (x̄, ω) −→ w(x̄) P-a.s..
We now tackle the proof of (6.3.17).

L
For any x̄ ∈ X̄ the sequence Zn0n0 (x̄, ω) takes on values in the compact set S, i.e.
there exists an accumulation point fω (x̄). Since X̄ is finite, repeatedly taking
subsequences gives an index sequence n00 for which
L
Zn00n00 (x̄, ω) −→ fω (x̄) (6.3.19)
uniformly for all x̄ ∈ X¯ . n00 will be denoted by n again in the following.
For any fixed ω
1 X n
log < ZnLn (X̄i, ω), Xi > −E log < fω (X̄0 ), X0 >

n + 1 i=0

1 X n
≤ log < ZnLn (X̄i , ω), Xi > 1X¯ (X̄i )

n + 1 i=0

−E log < fω (X̄0 ), X0 > 1X¯ (X̄0 )

n
1 X
+ log < ZnLn (X̄i, ω), Xi > 1X̄ C (X̄i )

n + 1 i=0

+E log < fω (X̄0 ), X0 > 1X̄ C (X̄0 )

n
1 X
≤ log < ZnLn (X̄i , ω), Xi > − log < fω (X̄i ), Xi > 1X¯ (X̄i )

n + 1 i=0
1 X n
+ log < fω (X̄i ), Xi > 1X̄ (X̄i )

n + 1 i=0

−E log < fω (X̄0 ), X0 > 1X¯ (X̄0 )

1 X n
+c · 1X¯C (X̄i ) + P(X̄0 ∈ X̄ C ) (6.3.20)
n + 1 i=0
with a constant c = c(d, m, a, b) ∈ IR+ .
Because of the uniform convergence in (6.3.19) the first term of (6.3.20) satisfies
(for n sufficiently large)
log < Z Ln (X̄i, ω), Xi > − log < fω (X̄i), Xi > 1X̄ (X̄i)

n
≤ c · ZnLn (X̄i, ω) − fω (X̄i ) 1X¯ (X̄i ) ≤ c ·

(6.3.21)
for all i = 0, ..., n. Without loss of generality we may use the same constant c
as above.
For the second term it will be shown at the end of this proof that in P-a.a. ω
1 X n
lim sup log < fω (X̄i ), Xi > 1X̄ (X̄i)

n→∞ n + 1 i=0

−E log < fω (X̄0 ), X0 > 1X¯ (X̄0 ) = 0. (6.3.22)

For the third term the ergodic theorem yields that P-a.s.
n
1 X
lim 1X¯C (X̄i ) = P(X̄0 ∈ X̄ C ) ≤ . (6.3.23)
n→∞ n + 1
i=0
(6.3.21) to (6.3.23) plugged into (6.3.20) yield

1 X n
Ln

lim sup log < Zn (X̄i , ω), Xi > −E log < fω (X̄0 ), X0 > ≤ 3 · c · ,

n→∞ n + 1
i=0
and from being arbitrary it follows that

n
1 X
log < ZnLn (X̄i, ω), Xi >−→ E log < fω (X̄0 ), X0 >

(6.3.24)
n + 1 i=0
for P-a.a. ω.
On the other hand, using the definition of the random variable Ln , we obtain
P-a.s.
n
1 X
log < ZnLn (X̄i , ω), Xi > − log < w(X̄i), Xi >

n + 1 i=0
n
1 X
log < ZnL(X̄i , ω), Xi > − log < w(X̄i ), Xi > −→ 0, (6.3.25)

≥
n + 1 i=0
where a limit relation analogous to (6.3.15) can be used.

Plugging (6.3.24) into the first line of (6.3.25) and observing
n
1 X
log < w(X̄i ), Xi >−→ E log < w(X̄0), X0 >
n + 1 i=0
we obtain that (again for P-a.a. ω)

E log < fω (X̄0 ), X0 > − E log < w(X̄0 ), X0 > ≥ 0.
Because of the essential uniqueness of the optimum w(·), for P-a.a. ω we infer
(6.3.17), namely that fω (x̄) = w(x̄) PX̄0 − a.s..
So it only remains to demonstrate (6.3.22). To this end, let C := C [a, b]dm, S

be the space of continuous functions f : [a, b]dm −→ S, equipped with the supre-
mum norm supx̄∈[a,b]dm |f (x̄)|. For fω we can find a continuation f¯ω contained in
C, which coincides with fω on X̄ . Due to the separability of C (Megginson, 1998,
Sec. 1.12) a denumerable set G ⊆ C can be found, such that for any given > 0
and any fω there exists a function gω ∈ G satisfying supx̄∈X̄ |fω (x̄) − gω (x̄)| ≤ .
For given f : [a, b]dm −→ S use the shorthand notation Hn(ω, f ) for
n
1 X
log < f (X̄i ), Xi > 1X̄ (X̄i) − E log < f (X̄0 ), X0 > 1X̄ (X̄0) .
n + 1 i=0
Then
|Hn (ω, fω )| = |Hn (ω, gω ) + Hn (ω, fω ) − Hn (ω, gω )|

n
1 X
≤ |Hn (ω, gω )| + log < fω (X̄i ), Xi > − log < gω (X̄i ), Xi > 1X¯ (X̄i )
n + 1 i=0

+E log < fω (X̄0 ), X0 > − log < gω (X̄0 ), X0 > 1X̄ (X̄0)
≤ |Hn (ω, gω )| + 2 · c · sup |fω (x̄) − gω (x̄)|
x̄∈X̄
≤ |Hn (ω, gω )| + 2 · c · .
Because of being arbitrary, it suffices to convince ourselves of
H(ω, gω ) := lim sup |Hn(ω, gω )| = 0

n→∞
for P-a.a. ω. As to this, observe that (because of G being denumerable) the set
[
{ω|∃g ∈ G H(ω, g) > 0} = {ω|H(ω, g) > 0}
g∈G
is measurable. Using the ergodic theorem, the left hand side is a countable
union of null sets, i.e. null set itself. Hence for P-a.a. ω
H(ω, g) = 0 for all g ∈ G
and in particular H(ω, gω ) = 0, which completes the proof. 2

6.4 Simulations and examples

We conclude this chapter by simulations and examples in which we apply the
estimated log-optimal portfolio selection functions of Section 6.2 to simulated
and real markets. Throughout we select portfolios on the basis of the last d = 5
observed return data.
Example 6.1: The market consists of a riskless bond with return of 2.6%
per market period and a share that follows a geometrical Brownian motion
(Luenberger, 1998, Sec. 11.7; Korn and Korn, 1999, Ch. 2) with a mean
return µ = 3% per market period and and a volatility σ = 15% per market
period. Investment starts after 5 market periods and ends after 50 market
periods. In this model, due to the independence of the share’s log-returns, the
log-optimal portfolio selection function coincides with the log-optimal portfolio,
which suggests to invest 67.86% of the current wealth in each market period into
1 3.000
0.9 3.061
0.8 3.100
0.7 3.116
0.6 3.109
0.5 3.081
0.4 3.030
0.3 2.956
0.2 2.861
0.1 2.742
0 2.600
0 5 10 15 20 25 30 35 40 45 50
6.1 a) Proportion of wealth invested in the share (left vertical axis),

expected portfolio log-return (right vertical axis, in %). Results
for the true log-optimal strategy (dashed) and the estimated log-
optimal strategy (solid).
0
0 5 10 15 20 25 30 35 40 45 50
6.1 b) Value of a $1 investment in the share (grey, solid) or in the

bond (grey, dashed), respectively. We compare the value of a $1
investment in the true log-optimal strategy (upper black curve) and
the value of a $1 investment in the estimated log-optimal strategy
(lower black curve).
Figure 6.1: Sample path of an investment in a share following a geometrical
brownian motion and a bond during 50 market periods.
the share (calculated by Cover’s algorithm, Theorem 1.3.2). Figures 6.1 and 6.2
show sample paths in the market together with estimation results. Throughout
this section we use the kernel variant of the projection algorithm (6.2.2) with a
cosine kernel K(x̄) = cos(min{kx̄kF /100, 1}) + 1 (the Frobenius norm kx̄kF of
x̄ = (xi,j )1≤i≤2,1≤j≤d being defined as the square root of the sum of the diagonal
elements of x̄T x̄) and L = 100.
Subgraphs a) of Figures 6.1 and 6.2 show the estimated log-optimal portfolio
weight for the share (solid line), i.e. the coordinate of ZnL(Xn+1−d , ..., Xn) that
corresponds to the share. The results can be compared with the true log-
optimal portfolio (dashed line) and the expected portfolio returns per market
period given on the right vertical axis (in %).
In subgraphs b) we follow the value of a $1 investment in the share (grey, solid)

or in the bond (grey, dashed), respectively. We compare the value of a $1
investment in the true log-optimal strategy (upper black curve) and the value
of a $1 investment in the strategy using the estimated log-optimal portfolio
weights (lower black curve). The results are convincing: As we expect from the
competitive optimality result (Corollary 6.2.3), the estimated strategy allows
to track the value evolution of the log-optimal strategy.
Example 6.2: In this example we run the projection algorithm for the estima-
tion of log-optimal portfolio selection functions on real market data from NYSE,
22/4/1998-6/7/1998 (daily closing price data from www.wallstreetcity.com).
We use the same stocks (YELL, JBHT, UNP) as in Example 2.2, Section 2.3.
Here, we do not know the true log-optimal portfolio selection function. As a

substitute reference strategy we use the strategy with the constant weights esti-
mated in Example 2.2, Section 2.3 (i.e., (0.523951, 0.476049) for (JBHT,YELL),
1 3.000
0.9 3.061
0.8 3.100
0.7 3.116
0.6 3.109
0.5 3.081
0.4 3.030
0.3 2.956
0.2 2.861
0.1 2.742
0 2.600
0 5 10 15 20 25 30 35 40 45 50
6.2 a) Proportion of wealth invested in the share (left vertical axis),

expected portfolio log-return (right vertical axis, in %). Results
for the true log-optimal strategy (dashed) and the estimated log-
optimal strategy (solid).
0
0 5 10 15 20 25 30 35 40 45 50
6.2 b) Value of a $1 investment in the share (grey, solid) or in the

bond (grey, dashed), respectively. We compare the value of a $1
investment in the true log-optimal strategy (upper black curve) and
the value of a $1 investment in the estimated log-optimal strategy
(lower black curve).
Figure 6.2: Sample path of an investment in a share following a geometrical
brownian motion and a bond during 50 market periods.
and (0.465490, 0.53451) for (UNP,YELL)). We then compare the value of a $1

investment in the strategy using the estimated log-optimal portfolio selection
function with the value of the reference strategy (Figure 6.3). Investment starts
on the 5th day of trading. The value of the estimated log-optimal portfolio
strategy virtually coincides with the value of what was believed to be the true
log-optimal strategy in Example 2.2, Section 2.3 (we therefore plotted the value
for the log-optimal portfolio selection function only). This suggests that in
either case we are close to the log-optimal portfolio selection strategy.
– One might be tempted to argue that compared with Section 2.3 not much has
been gained. This is not the case. On the contrary, in Section 2.3 we considered
a very restrictive model involving independent, log-normally distributed daily
1.2
1.15
1.1
1.05
0.95
0.9
0.85
0 5 10 15 20 25 30 35 40 45 50
6.3 a) JBHT (grey, solid) and YELL (grey, dashed).
1.2
1.15
1.1
1.05
0.95
0.9
0.85
0.8
0.75
0 5 10 15 20 25 30 35 40 45 50
6.3 b) UNP (grey, solid) and YELL (grey, dashed).
Figure 6.3: Value of a $1 investment in two single stocks (grey), and in the es-
timated log-optimal portfolio of the two (black, solid) at NYSE 22/4-6/7/1998.
180 L’Envoi
returns. We just sketched and, in fact, then skipped most of the huge effort that
should have been put into diagnostic testing of these assumptions in Section 2.3.
Much of this effort is superfluous here. Indeed, with the model of Chapter 6
we gained considerable flexibility with respect to the underlying market model,
assuming not much more than stationarity and ergodicity. Considering there
is no such thing as absolute certainty about what the true stochastic regime
in the market looks like, we come to appreciate nonparametric algorithms that
work well under very weak assumptions and hence may be applied in many real
markets.
L’ENVOI
Clearly, we were only able to cover a small subsample of the problems the
investor faces in real markets. In the course of this thesis we derived several al-
gorithms for these selected problems, and the reader might (and hopefully will)
find some of them helpful to decide on practical investment problems. Beyond
this algorithmics, we hope we have conveyed the key message of this thesis: The
insight that nonparametric statistical forecasting and estimation techniques are
a valuable tool in portfolio selection and, in fact, in all mathematical finance.
181
REFERENCES
P. Algoet (1992): Universal schemes for prediction, gambling and portfolio

selection, Ann. Probab., 20(2), 901-941.
P. Algoet (1994): The strong law of large numbers for sequential decisions
under uncertainty, IEEE Trans. Inform. Theory, 40(3), 609-634.
P. Algoet (1999): Universal schemes for learning the best nonlinear predic-
tor given the infinite past and side information, IEEE Trans. Inform.
Theory, 45(4), 1165-1185.
P. Algoet and Th. Cover (1988): Asymptotic optimality and asymptotic
equipartition properties of log-optimal investment, Ann. Probab., 16(2),
876-898.
C.D. Aliprantis and K.C. Border (1999): Infinite Dimensional Analysis,
Springer, Berlin.
H.Z. An, Z.G. Chen and E.J. Hannan (1982): Autocorrelation, autore-
gression and autoregressive approximation (with corrections), Ann. Stat.,
10(3), 926-936.
C. Atkinson, S.R. Pliska and P. Wilmott (1997): Portfolio manage-
ment with transaction costs, Proc. Roy. Soc. Lond., Ser. A 453, No.
1958, 551-562.
L. Bachelier (1900): Théorie de la spéculation, Ann. Sci. Ecole Norm.
Sup., 17, 21-86.
D.H. Bailey (1976): Sequential Schemes for Classifying and Predicting Er-
godic Processes, Ph.D. thesis, Stanford University.
A. Barron and Th. Cover (1988): A bound on the financial value of in-
formation, IEEE Trans. Inform. Theory, 34, 1097-1100.
J. Bather (2000): Decision Theory, Wiley, Chichester.
182 References
R. Bell and Th. Cover (1988): Game-theoretic optimal portfolios, Man-

agement Science, 34(6), 724-733.
D.P. Bertsekas (1976): Dynamic Programming and Stochastic Control,

Academic Press, New York.
D.P. Bertsekas and S.E. Shreve (1978): Stochastic Optimal Control:

The Discrete Time Case, Academic Press, New York.
T. Bielecki, D. Hernández-Hernández and S.R. Pliska (1999): Risk

sensitive control of finite state Markov chains in discrete time, with appli-
cations to portfolio management, Math. Meth. Oper. Res., 50, 167-188.
T. Bielecki and S.R. Pliska (1999): Risk–sensitive dynamic asset man-

agement, Appl. Math. Optim., 39, 337-360.
T. Bielecki and S.R. Pliska (2000): Risk sensitive asset management with
transaction costs, Finance Stochast., 4, 1-33.
A. Blum and A. Kalai (1999): Universal portfolios with and without

transaction costs, Mach. Learning, 35, 193-205.
R.V. Bobryk and L. Stettner (1999): Discrete time portfolio selection

with proportional transaction costs, Probab. Math. Statistics, 19(2),
235-248.
D. Bosq (1996): Nonparametric Statistics for Stochastic Processes, Springer,

New York.
A. Böttcher and S.M. Grudsky (1998): On the condition numbers of

large semidefinite Toeplitz matrices, Lin. Alg. Appl., 279, 285-301.
L. Breiman (1961): Optimal gambling systems for favourable games, in:

Fourth Berkeley Symposium on Mathematical Statistics and Probability,
Prague 1961, 65-78, University of California Press, Berkeley.
P.J. Brockwell and R.A. Davis (1991): Time Series: Theory and Meth-
ods. Springer, New York.
V.V. Buldygin and V.S. Doncenko (1977): Convergence to zero of Gaus-

sian sequences, Mat. Zametki, 21(4), 531-538, English translation: Math.
Notes Acad. Sci. USSR, 21(3-4), 296-300.
183
A. Cadenillas (2000): Consumption-investment problems with transaction

costs: Survey and open problems, Math. Meth. Oper. Res., 51, 43-68.
P. Caines (1988): Linear Stochastic Systems, Wiley, New York.
L. Carassus and E. Jouini (2000): A discrete time stochastic model for

investment with an application to the transaction cost case, J. Math.
Economics., 33, 57-80.
F.H. Clarke (1981): Generalized gradients of lipschitz functions, Adv. in

Math., 40, 52-67.
Th. Cover (1980): Competitive optimality of logarithmic investment, Math.

Oper. Res., 5(2), 161-166.
Th. Cover (1984): An algorithm for maximizing expected log investment

return, IEEE Trans. Inform. Theory, 30, 369-373.
Th. Cover (1991): Universal portfolios, Math. Finance, 1(1), 1-29.
Th. Cover and E. Ordentlich (1996): Universal portfolios with side in-
formation, IEEE Trans. Inform. Theory, 42(2), 348-363.
Th. Cover and J.A. Thomas (1991): Elements of Information Theory, Wi-
ley, New York.
J.C. Cox, S.A. Ross and M. Rubinstein (1979): Option pricing: a sim-
plified approach, Journ. Financial Economics, 7, 229-263.
M.H. Davis and A.R. Norman (1990): Portfolio selection with transac-
tion costs, Math. Oper. Res., 15(4), 676-713.
L.D. Davisson (1965): The prediction error of stationary Gaussian time se-
ries of unknown covariance, IEEE Trans. Inform. Theory, 11(4), 527-532.
L. Devroye, L. Györfi and G. Lugosi (1996): A Probabilistic Theory of

Pattern Recognition, Springer, New York.
Z. Ding, C.W. Granger and R.F. Engle (1993): A long memory prop-
erty of stock market and a new model, J. Empir. Finance, 1, 83-106.
I. Donowitz and M. El-Gamal (1997): Financial market structure and

the ergodicity of prices, Social Systems Research Institute (SSRI) working
paper, University of Wisconsin-Madison.
184 References
J.L. Doob (1953): Stochastic Processes, Wiley, New York.
J.L. Doob (1984): Classical Potential Theory and its Probabilistic Counter-
part, Springer, New York.
P. Doukhan (1994): Mixing, Springer, New York.
D. Duffie (1988): Security Markets - Stochastic Models, Academic Press,

San Diego.
W. Feller (1968): An Introduction to Probability Theory and Its Applica-

tions, vol. 1, Wiley, New York.
B.G. Fitzpatrick and W.H. Fleming (1991): Numerical methods for an

optimal investment-consumption model, Math. Meth. Oper. Res., 16(4),
823-841.
W.H. Fleming (1999): Controlled Markov processes and mathematical fi-

nance, in: Nonlinear analysis, differential equations and control, NATO
Sci. Ser. C Math. Phys. Sci., 528, p. 407-446, Kluwer, Dordrecht.
L.R. Foulds (1981): Optimization Techniques, Springer, New York.
J. Francis (1980): Investments - Analysis and Management, McGraw-Hill,

New York.
J. Franke, W. Härdle and C. Hafner (2001): Einführung in die Statis-

tik der Finanzmärkte, Springer, Berlin.
L. Gerencsér (1992): AR(∞) estimation and nonparametric stochastic com-

plexity, IEEE Trans. Inform. Theory, 38(6), 1768-1778.
L. Györfi, W. Härdle, P. Sarda and P. Vieu (1989): Nonparametric

Curve Estimation from Time Series, Springer, New York.
L. Györfi, M. Kohler, A. Krzyżak and H. Walk (2002): A Nonpara-

metric Theory of Regression, Springer, New York.
L. Györfi and G. Lugosi (2001): Strategies for sequential prediction of sta-

tionary time series, in “Modeling Uncertainty. An Examination of its
Theory, Methods and Applications,” Eds. M. Dror, P. L’Ecuyer and F.
Szidarovszky, Kluwer, Dordrecht, 225-248.
185
L. Györfi, G. Lugosi and G. Morvai (1999): A simple randomized algo-

rithm for sequential prediction of ergodic time series, IEEE Trans. Inform.
Theory, 45(7), 2642-2650.
L. Györfi, G. Morvai and S.J. Yakowitz (1998): Limits to consistent
on-line forecasting for ergodic time series, IEEE Trans. Inform. Theory,
44(2), 886-892.
L. Györfi, I. Páli and E. van der Meulen (1994): There is no univer-
sal source code for an infinite source alphabet, IEEE Trans. Inform.
Theory, 40(2), 267-271.
E.J. Hannan and M. Deistler (1988): The Statistical Theory of Linear

Systems, Wiley, New York.
W. Härdle (1990): Applied Nonparametric Regression, Cambridge UP, Cam-

bridge.
W. Härdle, G. Kerkyacharian, D. Picard and A. Tsybakov (1998):

Wavelets, Approximation, and Statistical Applications, Springer, New
York.
D. Helmbold, R.E. Schapire, Y. Singer and M.K. Warmuth(1998):

On-line portfolio selection using multiplicative updates, Math. Finance,
8(4), 325-347.
O. Hernández-Lerma and J.B. Lasserre (1996): Discrete-Time Markov

Control Processes, Springer, New York.
O. Hernández-Lerma and J.B. Lasserre (1999): Further Topics on Dis-

crete-Time Markov Control Processes, Springer, New York.
T. Hida and M. Hitsuda (1993): Gaussian Processes. Translations of Math.
Monographs, 10, AMS, Providence.
F. Hirzebruch and W. Scharlau (1996): Einführung in die Funktional-

analysis, Spektrum Akademischer Verlag, Heidelberg.
I.A. Ibragimov and Y.V. Linnik (1971): Independent and Stationary Se-
quences of Random Variables, Wolters-Noordhoff, Groningen.
I.A. Ibragimov and Y.A. Rozanov (1978): Gaussian Random Processes,

Springer, New York.
186 References
E. Isaacson and H.B. Keller (1994): Analysis of Numerical Methods,

Dover Pub., New York.
J. Kelly (1956): A new interpretation of information rate, Bell Sys. Tech.

Journal, 35, 917-926.
K. Knopp (1956): Infinite Sequences and Series, Dover Publ., New York.
R. Korn and E. Korn (1999): Optionsbewertung und Portfoliooptimierung,

Vieweg, Braunschweig.
H.J. Kushner and D.S. Clark (1978): Stochastic Approximation Meth-

ods for Constrained and Unconstrained Systems, Springer, New York.
N. Laib (1999): Uniform consistency of the partitioning estimate under er-

godic conditions, J. Austral. Math. Soc., Series A, 67, 1-14.
N. Laib and E. Ould-Said (1996): Estimation non paramétrique robuste

de la fonction de régression pour des observations ergodiques, C. R. Acad.
Sci. Paris, Série I, 322, 271-276.
H.A. Latané (1959): Criteria for choice among risky ventures, Journal of
Political Economy, 38, 145-155.
Yu.S. Ledyaev (1984): Theorems on an implicitly defined multivalued map-

ping, Sov. Math. Dokl., 29(3), 545-548.
E. Lehmann (1983): Theory of Point Estimation, Wiley, New York.
D.G. Luenberger (1998): Investment Science, Oxford UP, Oxford.
H.M. Markowitz (1959): Portfolio selection, Wiley, New York.
H.M. Markowitz (1976): Investment for the long run: new evidence for an
old rule, J. Finance, 31(5), 1273-1286.
G. Matheron (1975): Random Sets and Integral Geometry, Wiley, New York.
J.H. McCulloch (1996): Financial applications of stable distributions, in

“Statistical Methods in Finance” (Handbook of Statistics, 14), Elsevier
Science, Amsterdam.
R.E. Megginson (1998): An Introduction to Banach Space Theory, Springer,

New York.
187
S. Mittnik and S.T. Rachev (1993): Modeling asset returns with alterna-
tive stable distributions, Econometric Review, 12, 261-330.
I.S. Molchanov (1993): Limit Theorems for Unions of Random Closed Sets,
Lecture Notes in Mathematics 1561, Springer, Berlin.
H.W. Möller (1998): Die Börsenformel, Campus, Frankfurt.
T. F. Móri (1982): Asymptotic properties of the empirical strategy in favou-

rable stochastic games, in “Limit theorems in probability and statistics”,
Colloq. Math. Soc. János Bolyai (Veszprém, 1982), 36, 777-790.
G. Morvai (1991): Empirical log-optimal portfolio selection, Problems of Con-

trol and Information Theory, 20(6), 453-463.
G. Morvai (1992): Portfolio choice based on the empirical distribution, Ky-

bernetika, 28(6), 484-493.
G. Morvai, S. Yakowitz and P. Algoet (1997): Weakly convergent non-

parametric forecasting of stationary time series, IEEE Trans. Inform.
Theory, 43(2), 483-498.
G. Morvai, S. Yakowitz and L. Györfi (1996): Nonparametric inference

for ergodic, stationary time series, Ann. Statist., 24(1), 370-379.
J. Neveu (1968): Processus Aléatoires Gaussiens, Université de Montréal.
J.A. Ohlson (1972): Optimal portfolio selection in a log-normal market when

the investorŠs utility-function is logarithmic, Stanford Graduate School
of Business, Research paper no. 117, Stanford University.
F. Österreicher and I. Vajda (1993): Existence, uniqueness and evalua-

tion of log-optimal investment portfolio, Kybernetica, 29(2), 105-120.
A. Pagan and A. Ullah (1999): Nonparametric Econometrics, Cambridge

UP, Cambridge.
E.E. Peters (1997): Fractal Market Analysis, Wiley, New York.
V. Petrov (1995): Limit Theorems of Probability Theory, Clarendon, Ox-

ford.
J. Rissanen (1989): Stochastic Complexity in Statistical Inquiry, World Sci-

entific, Teaneck, NJ.
188 References
S.M. Ross (1970): Applied Probability Models with Optimization Applica-

tions, Holden-Day, San Francisco.
B.Y. Ryabko (1988): Prediction of random sequences and universal coding,

Problems of Inform. Trans., 24 (Apr./June), 87-96.
P.A. Samuelson (1967): General proof that diversification pays, J. Fin. and
Quant. Anal., 2, 1-13.
P.A. Samuelson (1969): Lifetime portfolio selection by dynamic stochastic

programming, Rev. Economomics and Statistics, 51, 239-246.
P.A. Samuelson (1971): The “fallacy” of maximizing the geometric mean in

long sequences of investment or gambling, Proc. Nat. Acad. Sci. U.S.A.,
68, 2493-2496.
D.W. Scott (1992): Multivariate Density Estmation: Theory, Practice and

Visualisation, Wiley, New York.
S. Serra (1998): On the extreme eigenvalues of Hermitian (block) Toeplitz

matrices, Lin. Alg. Appl., 270, 109-129.
S. Serra Capizzano (1999): Extreme singular values and eigenvalues of non-

Hermitian block Toeplitz matrices, Journ. Comp. Appl. Math., 108,
113-130.
S. Serra Capizzano (2000): How bad can positive definite Toeplitz matri-
ces be? Numer. Funct. Anal. and Optimiz., 21(1-2), 255-261.
A.N. Shiryayev (1984): Probability, Springer, New York.
L. Stettner (1999): Risk sensitive portfolio optimization, Math. Meth. Oper.

Res., 50, 463-474.
W.F. Stout (1974): Almost Sure Convergence, Academic Press, New York.
G. Touissaint (1971): Note on optimal selection of independent binary-va-

lued features for pattern recognition, IEEE Trans. Inform. Theory, 17,
618.
I. Vajda and F. Österreicher (1994): Statistical analysis and applications

of log-optimal investments, Kybernetica, 30(3), 331-342.
189
H. Walk (2000): Cover’s algorithm modified for nonparametric estimation of

a log-optimal portfolio selection function, Preprint 2000-2, Math. Inst. A,
Univ. Stuttgart.
H. Walk and S. Yakowitz (1999): Iterative nonparametric estimation of
a log-optimal portfolio selection function, IEEE Trans. Inform. Theory
48(1), 324-333.
M. Wax (1988): Order selection for AR models by predictive least squares,
IEEE Trans. Acoust. Speech Sign. Process., 36(4), 581-588.
D. Williams (1991): Probability with Martingales, Cambridge UP, Cam-
bridge.
S. Yakowitz, L. Györfi, J. Kieffer and G. Morvai (1999): Strongly
consistent nonparametric forecasting and regression for stationary ergodic
sequences, Journal of Multivariate Analysis, 71, 24-41.
Z. Ye and J. Li (1999): Optimal portfolio with risk control, Chinese J. Ap-
plied Probability and Statistics, 15(2), 152-167.

Non Parametric Estimation For Financial Investment Under Log-Utility

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Non Parametric Estimation For Financial Investment Under Log-Utility

Uploaded by

Copyright:

Available Formats

Nonparametric Estimation for

Financial Investment under Log-Utility

Von der Fakultät Mathematik der Universität Stuttgart

Hauptberichter: Prof. Dr. H. Walk

Tag der mündlichen Prüfung: 15. Juli 2002

Mathematisches Institut A der Universität Stuttgart

Professor Paul Glendinning

1 Introduction: investment and nonparametric statistics 33

2 Portfolio benchmarking: rates and dimensionality 47

3 Predicted stock returns and portfolio selection 73

4 A Markov model with transaction costs: probabilistic view 103

5 A Markov model with transaction costs: statistical view 129

6 Portfolio selection functions in stationary return processes 151

|·| absolute value of a number, cardinality of a set

IN positive integers 1, 2, 3, ...

bxc integer part of x

·T transpose of a vector or matrix

exp exponential to the base e

an = o(bn ) Landau symbol for: an /bn → 0

AC complement of the set A

1A characteristic function of the set A

L1 (P) space of Lebesgue integrable functions w.r.t. P

const. a suitable constant

In this thesis we aim to plead for the application of nonparametric statistical

Xn = (X1,n , ..., Xm,n)T ∈ IRm

To the investor, the return process {Xn }∞ n=1 appears to be a stochastic

we use nonparametric models, i.e. models that do not assume a para-

These models guarantee highest flexibility in real applications.

– As to the investment actions: We are concerned with an investor who neither

bn = (b1,n, ..., bm,n)T

satisfying bj,n ≥ 0 and m

– As to the investment goal: If W0 is his initial wealth, an investor using

b∗n := arg max E [log < b, Xn >| Xn−1 , ...X1]

at time n is optimal, outperforming any other strategy because of

this is sufficient reason for the investor to use a logarithmic utility

strategy {b∗n}n . More formally, {b̂n }n should give

hope that underperformance vanishes sufficiently fast when – with increasing

From our log-utility point of view we suggest to measure underperfor-

expectation becomes, the better is the selection rule b̂n+1 .

no empirical portfolio selection rule can make underperformance van-

proves to be rate optimal in as far as

the empirical log-optimal portfolio selection rule (0.0.1) attains the

Loosely speaking, it compensates for wrong investment decisions as fast as pos-

of stocks in the market, which is a rather untypical feature in nonparamet-

This is a somewhat negative result, but it warns us that reasonable selection

1. Produce forecasts of the market future. It is established that forecasts

the observed past, i.e. on

Ŷn+1 := E[Yn+1|Yn , Yn−1, ...].

2. Invest in those stocks whose forecast Ŷn+1 promises to beat a riskless

an optimal portfolio selection function c can be obtained from a solu-

The Bellman equation is known from the theory of dynamic programming,

we will show in Section 5.1 how a natural empirical counterpart of the

With similar techniques as in Chapter 4 we will establish that this empirical

This will lead us to a strategy that merely relies on observational data

For this, we will fall back on generalizations of existing uniform consistency

R(g, b, x) := E[g(X1, b)|X0 = x] (b ∈ S)

by a kernel regression estimator Rn (g, b, x). Depending on the smoothness of a

i.e. of the expected uniform estimation error, uniformly in G (Corollary 5.2.2).

such that (< ·, · > denoting the Euclidean scalar product)

for all measurable f : IRdm ∗

In Section 6.2 we therefore develop an algorithm to produce estimates

we establish the strong convergence of the estimates b̂n to the true

What is even more important in practical applications: