You are on page 1of 215

i

i
i
i
Exact and Approximate Modeling
of Linear Systems:
A Behavioral Approach
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
Ivan Markovsky Jan C. Willems
Sabine Van Huffel Bart De Moor
Leuven, December 29, 2005 i
i
i
i i
i
i
i
Preface
The behavioral approach, put forward in the three part paper by J. C. Willems [Wil87],
includes a rigorous framework for deriving mathematical models, a eld called system
identication. By the mid 80s there was a well developed stochastic theory for linear
time-invariant systemidenticationthe prediction error approach of L. Ljungwhich has
numerous success stories. Nevertheless, the rationale for using the stochastic framework,
the question of what is meant by an optimal (approximate) model, and even more basically
what is meant by a mathematical model remained to some extent unclear.
A synergy of the classical stochastic framework (linear system driven by white noise)
and a key result of [Wil87] that shows howa state sequence of the systemcan be obtained di-
rectly fromobserved data led to the very successful subspace identication methods [VD96].
Now the subspace methods together with the prediction error methods are the classical ap-
proaches for system identication.
Another follow-up of [Wil87] is the global total least squares approach due to Roorda
and Heij. In a remarkable paper [RH95], Roorda and Heij address an approximate iden-
tication problem truly in the behavioral framework, i.e., in a representation free setting.
Their results lead to practical algorithms that are similar in structure to the prediction error
methods: double minimization problems, of which the inner minimization is a smoothing
problem and the outer minimization is a nonlinear least squares problem. Unfortunately,
the global total least squares method has gained little attention in the system identication
community and the algorithms of [RH95, Roo95] did not nd their way to robust numerical
implementation and consequently to practical applications.
The aim of this book is to present and popularize the behavioral approach to mathe-
matical modeling among theoreticians and practitioners. The framework we adopt applies
to static as well as dynamic and to linear as well as nonlinear problems. In the linear static
case, the approximate modeling problem considered specializes to the total least squares
method, which is classically viewed as a generalization of the least squares method to tting
problems Ax b, in which there are errors in both the vector b and the matrix A. In the
quadratic static case, the behavioral approach leads to the orthogonal regression method for
tting data to ellipses. In the rst part of the book we examine static approximation prob-
lems: weighted and structured total least squares problems and estimation of bilinear and
quadratic models, and in the second part of the book we examine dynamic approximation
problems: exact and approximate system identication. The exact identication problem
falls in the eld of subspace identication and the approximate identication problem is the
global total least squares problem of Roorda and Heij.
i i
i
i
i
ii Preface
Most of the problems in the book are presented in a deterministic setting, although
one can give a stochastic interpretation to the methods derived. The appropriate stochastic
model for this aimis the errors-in-variables model, where all observed variables are assumed
inexact due to measurement errors added on true data generated by a true model. The
assumption of the existence of a true model and the additional stochastic ones about the
measurement errors, however, are rarely veriable in practice.
Except for the chapters on estimation of bilinear and quadratic models, we consider
total least squares-type problems. The unifying framework for approximate modeling put
forward in the book is called mist approach. In philosophy it differs essentially from the
classical approach, called latency approach, where the model is augmented with unobserved
latent variables. Atopic of current researchis toclarifyhowthe mist andlatencyapproaches
compare and complement each other.
We do not treat in the book advanced topics like statistical and numerical robustness
of the methods and algorithms. On the one hand, these topics are currently less developed
in the mist setting than in the latency setting and, on the another hand, they go beyond
the scope of a short monograph. Our hope is that robustness as well as recursivity, further
applications, and connections with other methods will be explored and presented elsewhere
in the literature.
The prerequisites for reading the book are modest. We assume an undergraduate
level linear algebra and systems theory knowledge. Familiarity with system identication
is helpful but is not necessary. Sections with more specialized or technical material are
marked with . They can be skipped without loss of continuity on a rst reading.
This book is accompanied by a software implementation of the described algorithms.
The software is callable from MATLAB and most of it is written in MATLAB

code. This
allows readers who have access to and knowledge of MATLAB to try out the examples,
modify the simulation setting, and apply the methods on their own data.
The book is based on the rst authors Ph.D. thesis at the Department of Electrical
Engineering of the Katholieke Universiteit Leuven, Belgium. This work would be impos-
sible without the help of sponsoring organizations and individuals. We acknowledge the
nancial support received from the Research Council of K.U. Leuven and the Belgian Pro-
gramme on Interuniversity Attraction Poles, projects IUAP IV02 (19962001) and IUAP
V22 (20022006). The work presented in the rst part of the book is done in collaboration
with Alexander Kukush fromthe National Taras Shevchenko University, Kiev, Ukraine, and
the work presented in the second part is done in collaboration with Paolo Rapisarda from
the University of Maastricht, The Netherlands. We would like to thank Diana Sima and Rik
Pintelon for useful discussions and proofreading the drafts of the manuscript.
Ivan Markovsky
Jan C. Willems
Sabine Van Huffel
Bart De Moor
Leuven, Belgium
December 29, 2005 i
i
i
i
Contents
Preface i
1 Introduction 1
1.1 Latency and mist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data tting examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Classical vs. behavioral and stochastic vs. deterministic modeling . . . . . 9
1.4 Chapter-by-chapter overview

. . . . . . . . . . . . . . . . . . . . . . . 10
2 Approximate Modeling via Mist Minimization 15
2.1 Data, model, model class, and exact modeling . . . . . . . . . . . . . . . 15
2.2 Mist and approximate modeling . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Model representation and parameterization . . . . . . . . . . . . . . . . . 18
2.4 Linear static models and total least squares . . . . . . . . . . . . . . . . . 19
2.5 Nonlinear static models and ellipsoid tting . . . . . . . . . . . . . . . . 21
2.6 Dynamic models and global total least squares . . . . . . . . . . . . . . . 23
2.7 Structured total least squares . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
I Static Problems 27
3 Weighted Total Least Squares 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Kernel, image, and input/output representations . . . . . . . . . . . . . . 33
3.3 Special cases with closed form solutions . . . . . . . . . . . . . . . . . . 35
3.4 Mist computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Mist minimization

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Simulation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Structured Total Least Squares 49
4.1 Overview of the literature . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 The structured total least squares problem . . . . . . . . . . . . . . . . . 51
4.3 Properties of the weight matrix

. . . . . . . . . . . . . . . . . . . . . . 54
4.4 Stochastic interpretation

. . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Efcient cost function and rst derivative evaluation

. . . . . . . . . . . 60
iii i
i
i
i
iv Contents
4.6 Simulation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Bilinear Errors-in-Variables Model 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Adjusted least squares estimation of a bilinear model . . . . . . . . . . . 72
5.3 Properties of the adjusted least squares estimator . . . . . . . . . . . . . 75
5.4 Simulation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Fundamental matrix estimation . . . . . . . . . . . . . . . . . . . . . . . 78
5.6 Adjusted least squares estimation of the fundamental matrix . . . . . . . 80
5.7 Properties of the fundamental matrix estimator

. . . . . . . . . . . . . . 81
5.8 Simulation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6 Ellipsoid Fitting 85
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Quadratic errors-in-variables model . . . . . . . . . . . . . . . . . . . . 87
6.3 Ordinary least squares estimation . . . . . . . . . . . . . . . . . . . . . . 88
6.4 Adjusted least squares estimation . . . . . . . . . . . . . . . . . . . . . . 90
6.5 Ellipsoid estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.6 Algorithm for adjusted least squares estimation

. . . . . . . . . . . . . . 94
6.7 Simulation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
II Dynamic Problems 99
7 Introduction to Dynamical Models 101
7.1 Linear time-invariant systems . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Kernel representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3 Inputs, outputs, and input/output representation . . . . . . . . . . . . . . 105
7.4 Latent variables, state variables, and state space representations . . . . . . 106
7.5 Autonomous and controllable systems . . . . . . . . . . . . . . . . . . . 108
7.6 Representations for controllable systems . . . . . . . . . . . . . . . . . . 108
7.7 Representation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.8 Parameterization of a trajectory . . . . . . . . . . . . . . . . . . . . . . . 111
7.9 Complexity of a linear time-invariant system . . . . . . . . . . . . . . . . 113
7.10 The module of annihilators of the behavior

. . . . . . . . . . . . . . . . 113
8 Exact Identication 115
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.2 The most powerful unfalsied model . . . . . . . . . . . . . . . . . . . . 117
8.3 Identiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.4 Conditions for identiability . . . . . . . . . . . . . . . . . . . . . . . . 120
8.5 Algorithms for exact identication . . . . . . . . . . . . . . . . . . . . . 122
8.6 Computation of the impulse response from data . . . . . . . . . . . . . . 126
8.7 Realization theory and algorithms . . . . . . . . . . . . . . . . . . . . . 130
8.8 Computation of free responses . . . . . . . . . . . . . . . . . . . . . . . 132 i
i
i
i
Contents v
8.9 Relation to subspace identication methods

. . . . . . . . . . . . . . . . 133
8.10 Simulation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9 Balanced Model Identication 141
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.2 Algorithm for balanced identication . . . . . . . . . . . . . . . . . . . . 144
9.3 Alternative algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.4 Splitting of the data into past and future

. . . . . . . . . . . . . . . 146
9.5 Simulation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10 Errors-in-Variables Smoothing and Filtering 151
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
10.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.3 Solution of the smoothing problem . . . . . . . . . . . . . . . . . . . . . 153
10.4 Solution of the ltering problem . . . . . . . . . . . . . . . . . . . . . . 155
10.5 Simulation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
11 Approximate System Identication 159
11.1 Approximate modeling problems . . . . . . . . . . . . . . . . . . . . . . 159
11.2 Approximate identication by structured total least squares . . . . . . . . 162
11.3 Modications of the basic problem . . . . . . . . . . . . . . . . . . . . . 165
11.4 Special problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
11.5 Performance on real-life data sets . . . . . . . . . . . . . . . . . . . . . . 171
11.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
12 Conclusions 177
A Proofs 179
A.1 Weighted total least squares cost function gradient . . . . . . . . . . . . . 179
A.2 Structured total least squares cost function gradient . . . . . . . . . . . . 180
A.3 Fundamental lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.4 Recursive errors-in-variables smoothing . . . . . . . . . . . . . . . . . . 182
B Software 185
B.1 Weighted total least squares . . . . . . . . . . . . . . . . . . . . . . . . . 185
B.2 Structured total least sqaures . . . . . . . . . . . . . . . . . . . . . . . . 188
B.3 Balanced model identication . . . . . . . . . . . . . . . . . . . . . . . 192
B.4 Approximate identication . . . . . . . . . . . . . . . . . . . . . . . . . 192
Bibliography 199
Index 205 i
i
i
i i
i
i
i
Chapter 1
Introduction
The topic of this book is tting models to data. We would like the model to t the data
exactly; however, in practice often the best that can be achieved is only an approximate t.
A fundamental question in approximate modeling is how to quantify the lack of t between
the data and the model. In this chapter, we explain and illustrate two different approaches
for answering this question.
The rst one, called latency, augments the model with additional unobserved variables
that allow the augmented model to t the data exactly. Many classical approximate mod-
eling techniques such as the least squares and autoregressive moving average exogenous
(ARMAX) system identication methods are latency oriented methods. The statistical tool
corresponding to the latency approach is regression.
Analternative approach, calledmist, resolves the datamodel mismatchbycorrecting
the data, so that it ts the model exactly. The main example of the mist approach is the total
least squares method and the corresponding statistical tool is errors-in-variables regression.
1.1 Latency and Mist
Classically a model is dened as a set of equations involving the data variables, and the lack
of t between the data and the model is dened as a norm of the equation error, or residual,
obtained when the data is substituted in the equations. Consider, for example, the familiar
linear static model, represented by an overdetermined system of equations AX B, where
A, B are given measurements, and the classical least squares (LS) method, which minimizes
the Frobenius norm of the residual E := AX B, i.e.,
min
E,X
|E|
F
subject to AX = B +E.
The residual E in the LS problem formulation can be viewed as an unobserved, latent
variable that allows us to resolve the datamodel mismatch. An approximate model for
the data is obtained by minimizing some norm (e.g., the Frobenius norm) of E. This cost
function is called latency, and equation error based methods are called latency oriented.
A fundamentally different approach is to nd the smallest correction on the data that
makes the corrected data compatible with the model (i.e., resulting in a zero equation error).
1 i
i
i
i
2 Chapter 1. Introduction
Then the quantitative measure, called mist, for the lack of t between the data and the model
is taken to be a norm of the correction. Applied to the linear static model, represented by
the equation AX B, the mist approach leads to the classical total least squares (TLS)
method [GV80, VV91]:
min
A,B,X
_
_
_
A B
_
_
F
subject to (A+ A)X = B + B.
Here A, B are corrections on the data A, B; and X is a model parameter.
The latency approach corrects the model in order to make it match the data. The
mist approach corrects the data in order to make it match the model. Both ap-
proaches reduce the approximate modeling problem to exact modeling problems.
When the model ts the data exactly, both the mist and the latency are zero, but when the
model does not t the data exactly, in general, the mist and the latency differ.
Optimal approximate modeling aims to minimize some measure of the datamodel
mismatch over all models in a given model class. The latency and the mist are two
candidate measures for approximate modeling. The classical LS and TLS approximation
methods minimize, respectively, the latency and the mist for a linear static model class,
represented by the equation AX B. Similarly, the algebraic and geometric methods for
ellipsoid tting minimize the latency and the mist for a quadratic static model class. For
the linear time-invariant (LTI) dynamic model class, the latency and the mist approaches
lead to, respectively, the ARMAX and errors-in-variables (EIV) identication methods.
In the next section we illustrate via examples the mist and latency approaches for
data tting by linear static, quadratic static, and LTI dynamic models.
1.2 Data Fitting Examples
Consider a data set D = d
1
, . . . , d
N
consisting of 2 real variables, denoted by aand b, i.e.,
d
i
=
_
a
i
b
i
_
=: col(a
i
, b
i
) R
2
,
and N = 10 data points. This data is visualized in the plane; see Figure 1.1. The order of
the data points is irrelevant for tting by a static model. For tting by a dynamic model,
however, the data is viewed as a time series, and therefore the order of the data points is
important.
Line Fitting
First, we consider the problem of tting the data by a line passing through the origin (0, 0).
This problem is a special case of modeling the data by a linear static model. The classical
LS and TLS methods are linear static approximation methods and are applied next to the
line tting problem in the example. i
i
i
i
1.2. Data tting examples 3
2 0 2 4 6 8 10
2
0
2
4
6
8
10
a
b
(0, 0)
d
1
Figure 1.1. The data D consists of 2 variables and 10 data points (). (point (0, 0).)
Least Squares Method
If the data points d
1
, . . . , d
10
were on a line, then they would satisfy a linear equation
a
i
x = b
i
, for i = 1, . . . , 10 and for some x R.
The unknown x is a parameter of the tting line (which from the modeling point of view is
the linear static model). In the example, the parameter x has a simple geometric meaning: it
is the tangent of the angle between the tting line and the horizontal axis. Therefore, exact
tting of a (nonvertical) line through the data boils down to choosing x R.
However, unless the data points were on a line to begin with, exact t would not be
possible. For example, when the data is obtained from a complicated phenomenon or is
measured with additive noise, an exact t is not possible. In practice most probably both
the complexity of the data generating phenomenon and the measurement errors contribute
to the fact that the data is not exact.
The latency approach introduces an equation error e = col(e
1
, . . . , e
10
), so that there
exists a corresponding parameter x R, satisfying the modied equation
a
i
x = b
i
+e
i
, for i = 1, . . . , 10.
For any given data set D and a parameter x R, there is a corresponding e, dened by
the above equation, so that indeed the latency term e allows us to resolve the datamodel
discrepancy.
The LS solution x
ls
:=
_
10
i=1
b
i
a
i
_
/
_
10
i=1
a
2
i
_
minimizes the latency,
latency := |e|,
over all x R. The line corresponding to the parameter x
ls
is the optimal tting line
according to the latency criterion. It is plotted in the left plot of Figure 1.2.
The LS method can also be given an interpretation of correcting the data in order to
make it match the model. The equation error e can be viewed as a correction on the second i
i
i
i
4 Chapter 1. Introduction
2 0 2 4 6 8 10
2
0
2
4
6
8
10
Latency approach
a
b
2 0 2 4 6 8 10
2
0
2
4
6
8
10
a
b
Mist approach
Figure 1.2. Optimal tting lines () and data corrections (- - -).
coordinate b. The rst coordinate a, however, is not corrected, so that the LS corrected
data is
a
ls,i
:= a
i
and

b
ls,i
:= b
i
+e
i
, for i = 1, . . . , 10.
By construction the corrected data lies on the line given by the parameter x
ls
, i.e.,
a
ls,i
x
ls
=

b
ls,i
, for i = 1, . . . , 10.
The LS corrections d
ls,i
:= col(0, e
i
) are vertical lines in the data space (see the dashed
lines in Figure 1.2, left).
Geometrically, the latency is the sum of the squared vertical distances from the
data points to the tting line.
Total Least Squares Method
The mist approach corrects both coordinates a and b in order to make the corrected data
exact. It seeks corrections d
1
, . . . , d
10
, such that the corrected data

d
i
:= d
i
+ d
i
lies on a line; i.e., with col( a
i
,

b
i
) :=

d, there is an x R, such that
a
i
x =

b
i
, for i = 1, . . . , 10.
For a given parameter x R, let D = d
1
, . . . , d
10
be the smallest in the Frobe-
nius norm correction of the data that achieves an exact t. The mist between the line
corresponding to x and the data is dened as
mist :=
_
_
_
d
1
d
10
_
_
F
. i
i
i
i
1.2. Data tting examples 5
Geometrically, the mist is the sum of the squared orthogonal distances from the
data points to the tting line.
The optimal tting line according to the mist criterion and the corresponding data correc-
tions are shown in the right plot of Figure 1.2.
Ellipsoid Fitting
Next, we consider tting an ellipse to the data. This problem is a special case of modeling
the data by a quadratic static model. We show the latency and mist optimal tting ellipses.
The mist has the geometric interpretation of nding the orthogonal projections of the data
points on the ellipse. The latency, however, has no meaningful geometric interpretation in
the ellipsoid tting case.
Algebraic Fitting Method
If the data points d
1
, . . . , d
10
were on an ellipse, then they would satisfy a quadratic equation
d

i
Ad
i
+

d
i
+c = 0, for i = 1, . . . , 10 and
for some A R
22
, A = A

, A > 0, R
2
, c R.
The symmetric matrix A, the vector , and the scalar c are parameters of the ellipse (which
fromthe modeling point of viewis the quadratic static model). As in the line tting example,
generically the data does not lie on an ellipse.
The latency approach leads to what is called the algebraic tting method. It looks for
equation errors e
1
, . . . , e
10
and parameters

A R
22
,

R
2
, c R, such that
d

i

Ad
i
+

d
i
+ c = e
i
, for i = 1, . . . , 10.
Clearly, for any

A R
22
,

R
2
, c R, i.e., for any chosen second order surface (in
particular an ellipse), there is a corresponding equation error e := col(e
1
, . . . , e
10
) dened
by the above equation. Therefore, the latency term e again allows us to resolve the data
model discrepancy. The 2-normof e is by denition the latency of the surface corresponding
to the parameters

A,

, c and the data. The left plot of Figure 1.3 shows the latency optimal
ellipse for the data in the example.
Geometric Fitting Method
The mist approach leads to what is called the geometric tting method. In this case, the
aim is to nd the minimal corrections in a Frobenius norm sense d
1
, . . . , d
10
, such that
the corrected data

d
1
, . . . ,

d
10
lies on a second order surface; i.e., there exist

A R
22
,

R
2
, c R, for which

i

A

d
i
+


d
i
+ c = 0, for i = 1, . . . , 10.
For a given ellipse, the Frobenius norm of the smallest data corrections that make the data
exact for that ellipse is by denition the mist between the ellipse and the data. The norm
of the correction d
i
is the orthogonal distance from the data point d
i
to the ellipse. The
mist optimal ellipse is shown in the right plot of Figure 1.3. i
i
i
i
6 Chapter 1. Introduction
2 0 2 4 6 8 10
2
0
2
4
6
8
10
Latency approach
a
b
2 0 2 4 6 8 10
2
0
2
4
6
8
10
a
b
Mist approach
Figure 1.3. Optimal tting ellipses () and data corrections (- - -) for the mist approach.
(centers of the ellipses.)
Linear Time-Invariant System Identication
Next, we consider tting the data by a dynamic model. In this case the data D is viewed
as a vector time series. Figure 1.4 shows the data in the plane (as in the static case) but
with numbers indicating the data point index, viewed now as a time index. The dynamics
is expressed in a motion (see the arrow lines in the gure) starting from data point 1, going
to data point 2, then to data point 3 (for the same period of time), and so on, until the last
data point 10.
The considered model class consists of LTI systems with one input and one time lag.
2 0 2 4 6 8 10
2
0
2
4
6
8
10
1
2
3
4 5
6
7
8
9
10
a
b
Figure 1.4. The data D viewed as a time series. The numbers show the data point index,
or, equivalently, the time index. The arrow lines show the dynamics of the model: motion
through the consecutive data points. i
i
i
i
1.2. Data tting examples 7
a b LTI system
Figure 1.5. Signal processor interpretation of an LTI system.
Models of this type admit a difference equation representation
R
0
d
i
+R
1
d
i+1
= 0, where R
0
, R
1
R
12
.
The vectors R
0
and R
1
are parameters of the model.
Let R
i
=:
_
Q
i
P
i

, i = 1, 2, and suppose that P


1
,= 0. Then the variable a acts
as an input (free variable) and the variable b acts as an output (bound variable). This gives
an input/output separation of the variables
Q
0
a
i
+Q
1
a
i+1
= P
0
b
i
+P
1
b
i+1
and corresponds to the classical notion of a dynamical systemas a signal processor, accepting
inputs and producing outputs; see Figure 1.5.
Autoregressive Moving Average Exogenous and Output Error Identication
If the data D were an exact trajectory of an LTI model in the considered model class, then
there would exist vectors R
0
, R
1
R
12
(parameters of the model) and d
11
R
2
(initial
condition), such that
R
0
d
i
+R
1
d
i+1
= 0, for i = 1, . . . , 10.
However, generically this is not the case, so that an approximation is needed. The latency
approach modies the model equation by adding an equation error e

R
0
d
i
+

R
1
d
i+1
= e
i
, for i = 1, . . . , 10.
The residual e can be considered to be an unobserved (latent) variable; see Figure 1.6.
From this point of view it is natural to further modify the system equation by allowing
for a time lag in the latent variable (as in the other variables)
Q
0
a
i
+Q
1
a
i+1
P
0
b
i
P
1
b
i+1
= M
0
e
i
+M
1
e
i+1
. ()
The real numbers M
0
and M
1
are additional parameters of the model.
An interesting special case of the latent variable equation (), called output error
identication model, is obtained when M
0
= P
0
and M
1
= P
1
. Then the latent variable e
e
a
b LTI system
Figure 1.6. LTI system with a latent variable e. i
i
i
i
8 Chapter 1. Introduction
2 0 2 4 6 8 10
2
0
2
4
6
8
10
1
1
2
2
3
3 4
4
5
5
6
6
7
7
88
9
9
10
10
Latency (output error) approach
a
b
2 0 2 4 6 8 10
2
0
2
4
6
8
10
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
a
b
Mist approach
Figure 1.7. Data D (), optimal tting trajectory

D
oe
(- - -), and data corrections ( ).
acts like a correction on the output. The input, however, is not corrected, so that the corrected
data by the output error model is
a
oe,i
:= a
i
, and

b
oe,i
:= b
i
+e
i
, for i = 1, . . . , 10.
By construction the corrected time series

d
oe
:= col( a
oe
,

b
oe
) satises the equation
Q
0
a
oe,i
+Q
1
a
oe,i+1
= P
0

b
oe,i
P
1

b
oe,i+1
.
The optimal output error tting data

D
oe
:=

d
oe,1
, . . . ,

d
oe,10
over the parameters P
i
,
Q
i
(i.e., over all models with one input and one time lag) is visualized in the left plot of
Figure 1.7.
Note the similarity between the output error identication method and the classical
LS method. Indeed,
output error identication can be viewed as a dynamic LS method.
Errors-in-Variables Identication
The mist approach leads to what is called the global total least squares method. It is a
generalization of the TLS method for approximate modeling by an LTI dynamic model. In
this case the given time series is modied by the smallest corrections d
1
, . . . , d
10
, in a
Frobenius norm sense, such that the corrected time series

d
i
:= d
i
+ d
i
, i = 1, . . . , 10
is a trajectory of a model in the model class. Therefore, there are parameters of the model

R
0
,

R
1
R
12
and an initial condition

d
11
R
2
, such that

R
0

d
i
+

R
1

d
i+1
= 0, for i = 1, . . . , 10.
The right plot of Figure 1.7 shows the mist optimal tting data

D. i
i
i
i
1.3. Classical vs. behavioral and stochastic vs. deterministic modeling 9
1.3 Classical vs. Behavioral and Stochastic vs.
Deterministic Modeling
In what sense can the examples of Section 1.2 be viewed as data modeling? In other words,
what are the models in these examples? In the line tting case, clearly the model is a line.
The data is a collection of points in R
2
and the model is a subset of the same space. In the
ellipsoid tting case, the model is an ellipse, which is again a subset of the data space R
2
.
A line and an ellipse are static models in the sense that they describe the data points without
relations among them. In particular, their order is not important for static modeling.
In the system identication examples, the data set D is viewed as an entitya nite
vector time series. A dynamical model is again a subset, however, consisting of time series.
The geometric interpretation of the dynamic models is more subtle than the one of the static
models due to the time series structure of the data space. In the static examples of Section 1.2
the data space is 2-dimensional while in the dynamic examples it is 20-dimensional.
The point of view of the model as a subset of the data space is inspired by the
behavioral approach to system theory.
This point of view has a number of important advantages over the classical point of view of
a model as a set of equations. In the behavioral approach an equation is a representation of
its solution set (which is the model itself). A model has innitely many representations, so
that a particular representation is not an intrinsic characteristic of the model.
Consider, for example, a linear static model B that is a one-dimensional subspace
of R
2
. Perhaps the most commonly used way to dene B is via the representation
B = d := col(a, b) [ ax = b .
However, the same model can be represented as the kernel of a 1 2 matrix R, i.e.,
B = ker(R) := d [ Rd = 0 ,
or as the image of a 2 1 matrix P, i.e.,
B = col span(P) := d [ there is l, such that d = Pl .
Moreover, the parameters R and P of a kernel and an image representation are not unique.
Which particular representation one is going to choose is a matter of convenience. Therefore,
an approximate modeling problem formulation in terms of a particular representation is
unnecessarily restrictive. Note that the representation ax = b does not exist for all one-
dimensional subspaces of R
2
. (Consider the vertical line col span(col(0, 1)).)
Another feature in which the presentation in this book differs from most of the ex-
isting literature on approximate modeling is the use of deterministic instead of stochastic
assumptions and techniques. It is well known that the classical LS method has deterministic
as well as stochastic interpretations. The same duality exists (and is very much part of the
literature) for other modeling methods. For example, the TLS method, introduced by Golub
and Van Loan [GV80] in the numerical linear algebra literature as a tool for approximate
solution of an overdetermined linear system of equations, can be viewed as a consistent
estimator in the linear EIV model, under suitable statistical assumptions. i
i
i
i
10 Chapter 1. Introduction
One and the same modeling method can be derived and justied in deterministic
as well as stochastic setting.
Both approaches are useful and contribute to a deeper understanding of the methods. In
our opinion, however, the stochastic paradigm is overused and sometimes misused. Often
the conceptual simplicity of the deterministic approach is an important advantage (certainly
so from the pedagogical point of view). Unlike the stochastic approach, the deterministic
one makes no unveriable assumptions about the data generating phenomenon. As a con-
sequence, however, fewer properties can be proven in the deterministic setting than in the
stochastic one.
Most of the problems in the book are posed in the behavioral setting and use the
mist approach. This new paradigm and related theory are still under development and are
currently far less mature than the classical stochastic latency oriented approach. Our aim is
to popularize and stimulate interest in the presented alternative approaches for approximate
modeling.
1.4 Chapter-by-Chapter Overview

The introduction in Sections 1.11.3 is informal. Chapter 2 gives an in-depth introduction to


the particular problems considered in the book. The main themesexact and mist optimal
approximate modelingare introduced in Sections 2.1 and 2.2 . Then we elaborate on
the model representation issue. An important observation is that the mist optimal model
is independent of the particular representation chosen, but the latency optimal model in
general depends on the type of representation. In Sections 2.42.6 we specify the mist
approximation problem for the linear static, bilinear and quadratic static, and LTI dynamic
model classes. An approximate modeling problem, called structured total least squares
(STLS), which can treat various static and dynamic linear mist approximation problems,
is introduced in Section 2.7. Chapter 2 ends with an overview of the adopted solution
methods.
The book is divided into two parts:
Part I deals with static models and
Part II deals with dynamic models.
Optional sections (like this section) are marked with . The material in the optimal sections
is more technical and is not essential for the understanding of what follows.
Chapter 3: Weighted Total Least Squares The weighted total least squares (WTLS)
problemis a mist basedapproximate modelingproblemfor linear static models. The WTLS
mist is dened as a weighted projection of the data D on a model B. The choice of the
weight matrices for the projection is discussed in Section 3.1, where two possibilities are
described. The rst one leads to a problem, called relative error total least squares, and the
second one leads to the problem of maximum likelihood estimation in the EIV model.
The kernel, image, and input/output representations of a linear static model are pre-
sented in Section 3.2. We believe that these representations and the links among them are
prerequisites for the proper understanding of all static approximation problems. i
i
i
i
1.4. Chapter-by-chapter overview

11
In Section 3.3, we solve the TLS and the generalized TLS problems, which are special
cases of the WTLS problem. They are treated separately because a closed form solution in
terms of the singular value decomposition (SVD) exists. The ingredients for the solution
are
1. the equivalence between data consistent with a linear static model and a low-rank
matrix, and
2. the EckartYoungMirsky low-rank approximation lemma, which shows how an op-
timal (in the sense of the Frobenius norm) low-rank approximation of a given matrix
can be computed via SVD.
The solution of the TLS problem is given in terms of the SVD of the data matrix and the
solution of the GTLS problem is given in a similar way in terms of the SVD of a modied
data matrix.
The WTLS problem is a double minimization problem. In Section 3.4, we solve
in closed form the inner minimization, which is the mist computation subproblem. The
results are given in terms of kernel and image representations, which lead to, respectively,
least norm and least squares problems.
In the optional Section 3.5, we consider the remaining subproblemminimization
with respect to the model parameters. It is a nonconvex optimization problem that in
general has no closed form solution. For this reason, numerical optimization methods are
employed. We present three heuristic algorithms: alternating least squares, an algorithm
due to Premoli and Rastello, and an algorithmbased on standard local optimization methods.
Chapter 4: Structured Total Least Squares The STLS problemis a exible tool that
covers various mist minimization problems for linear models. We review its origin and
development in Section 4.1. There are numerous (equivalent) formulations that differ in the
representation of the model and the optimization algorithmused for the numerical solution of
the problem. The proposed methods, however, have high computational complexity and/or
assume a special type of structure that limit their applicability in real-life applications. Our
motivation is to overcome as much as possible these limitations and propose a practically
useful solution.
In Section 4.2, we dene the considered STLS problem. The data matrix is parti-
tioned into blocks and each of the blocks is block-Toeplitz/Hankel structured, unstructured,
or exact. As shown in Section 4.6, this formulation is general enough to cover many struc-
tured approximation problems and at the same time allows efcient solution methods. Our
solution approach is based on the derivation of a closed form expression for an equivalent
unconstrained problem, in which a large number of decision variables are eliminated. This
step corresponds to the mist computation in the mist approximation problems.
The remaining problemis a nonlinear least squares problemand is solved numerically
via local optimization methods. The cost function and its rst derivative evaluation, how-
ever, are performed efciently by exploiting the structure in the problem. In the optional
Section 4.3, we prove that as a consequence of the structure in the data matrix, the equiv-
alent optimization problem has block-Toeplitz and block-banded structure. In Section 4.4,
a stochastic interpretation of the Toeplitz and banded structure of the equivalent problem is
given. i
i
i
i
12 Chapter 1. Introduction
A numerical algorithm for solving the STLS problem is described in Section 4.5.
It is implemented in the software package described in Appendix B.2. In Section 4.6, we
showsimulation examples that demonstrate the performance of the proposed STLS solution
method on standard approximation problems. The performance of the STLS package is
compared with that of alternative methods on LS, TLS, mixed LS-TLS, Hankel low-rank
approximation, deconvolution, and system identication problems.
Chapter 5: Bilinear Errors-in-Variables Model In Chapter 5, we consider approxi-
mations by a bilinear model. The presentation is motivated fromthe statistical point of view
of deriving a consistent estimator for the parameters of the true model in the EIV setup. The
mist approach yields an inconsistent estimator in this case, so that an alternative approach
based on the adjustment of the LS approximation is adapted.
An adjusted least squares (ALS) estimator, which is in principle a latency oriented
method, is derived in Section 5.2, and its statistical properties are stated in the optional Sec-
tion 5.3. Under suitable conditions, it is strongly consistent and asymptotically normal. In
Section 5.4, we showsimulation examples illustrating the consistency of the ALS estimator.
In Section 5.5, we consider a different approximation problem by a static bilinear
model. It is motivated from an application in computer vision, called fundamental matrix
estimation. The approach is closely related to the one of Section 5.2.
Chapter 6: Ellipsoid Fitting The ALS approach of Chapter 5 is further applied for
approximation by a quadratic model. The motivation for considering the quadratic model is
the ellipsoid tting problem. In Section 6.1, we introduce the ellipsoid tting problem and
review the literature. As in Chapter 5, we consider the EIV model and note that the mist
approach, although intuitively attractive and geometrically meaningful, yields a statistically
inconsistent estimator. This motivates the application of the ALS approach.
In Section 6.2, we dene the quadratic EIV model. The LS and the ALS estimators
are presented, respectively, in Sections 6.3 and 6.4. The ALS estimator is derived from the
LS estimator by properly adjusting its cost function. Under suitable conditions the ALS
estimator yields a consistent estimate of the parameters of the true model. In the optional
Section 6.6, we present an algorithm for the computation of the ALS estimator. Simulation
examples comparing the ALS and alternative estimators on benchmark problems from the
literature are shown in Section 6.7.
Chapter 7: Introduction to Dynamical Models Chapter 7 is an introduction to Part II
of the book. The main emphasis is on the representation of an LTI system. Different
representations are suitable for different problems, so that familiarity with a large number
of alternative representations is instrumental for solving the problems. First, we give a high
level characterization of an LTI system: its behavior is linear, shift-invariant, and closed
in the topology of pointwise convergence. Then we consider a kernel representation of an
LTI system, i.e., difference equation representation. However, we use polynomial matrix
notation. Asequence of equivalence operations on the difference equations is represented by
premultiplication of a polynomial operator by a unimodular matrix. Also, certain properties
of the representation such as minimality of the number of equations is translated to equivalent
properties of polynomial matrices. Special forms of the polynomial matrix display important i
i
i
i
1.4. Chapter-by-chapter overview

13
invariants of the system such as the number of inputs and the minimal state dimension.
We discuss the question of what inputs and outputs of the system are and show repre-
sentations that display the input/output structure. The classical input/state/output represen-
tation of an LTI system is obtained by introducing, in addition, latent variables with special
properties. The controllability property of a systemis introduced and a test for it is shown in
terms of a kernel representation. Any system allows a decomposition into an autonomous
subsystem and a controllable subsystem. A controllable system can be represented by a
transfer function or a convolution operator or as the image of a polynomial operator. Finally,
the latent variable and driving input state space representation are presented.
The introduction of the various system representations is summarized by a represen-
tation theorem that states their equivalence. The chapter continues with the related question
of parameterizing a trajectory of the system. The most convenient representation for this
purpose is the input/state/output representation that displays explicitly both the input and
the initial conditions.
Chapter 8: Exact Identication The simplest and most basic system identication
problem is considered rst: given a trajectory of an LTI system, nd a representation of
that system. The data is an exact trajectory and the system has to be recovered exactly. The
problem can be viewed as a representation question: pass from a sufciently informative
trajectory to a desirable representation of the system.
We answer the question of when a trajectory is sufciently informative in order to
allow exact identication. This key result is repeatedly used and is called the fundamental
lemma.
The exact identication problemis closely related to the construction of what is called
the most powerful unfalsied model (MPUM). Under the condition of the fundamental
lemma, the MPUMis equal to the data generating system, so that one can look for algorithms
that obtain specic representations of that system from the data. We review algorithms for
passing from a trajectory to kernel, convolution, and input/state/output representations.
Relationships to classical deterministic subspace identication algorithms are given.
Our results show alternative system theoretic derivations of the classical subspace
identication methods. In particular, the orthogonal and oblique projections from the
MOESP and N4SID subspace identication methods are interpreted. It is shown that the
orthogonal projection computes free responses and the oblique projection computes sequen-
tial free responses, i.e., free responses of which the initial conditions form a state sequence.
From this perspective, we answer the long-standing question in subspace identication of
how to partition the data into past and future. The past is used to set the initial
condition for a response computed in the future.
The system theoretic interpretation of the orthogonal and oblique projections reveals
their inefciency for the purpose of exact identication. We present alternative algorithms
that correct this deciency and show simulation results that illustrate the performance of
various algorithms for exact identication.
Chapter 9: BalancedModel Identication Balancingis oftenusedas a tool for model
reduction. In Chapter 9, we consider algorithms for obtaining a balanced representation of
the MPUM directly from data. This is a special exact identication problem. i
i
i
i
14 Chapter 1. Introduction
Two algorithms were previously proposed in the setting of the deterministic subspace
identication methods. We analyze their similarity and differences and show that they fall
under the same basic outline, where the impulse response and sequential zero input responses
are obtained from data. We propose alternative algorithms that need weaker assumptions
on the available data. In addition, the proposed algorithms are computationally more ef-
cient since the block-Hankel structure of certain matrices appearing in the computations is
explicitly taken into account.
Chapter 10: Errors-in-Variables Smoothing and Filtering The approximate sys-
tem identication problem, based on the mist approach, has as a subproblem the computa-
tion of the closest trajectory in the behavior of a given model to a given time series. This is a
smoothing problem whose solution is available in closed form. However, efcient recursive
algorithms are of interest. Moreover, the ltering problem, in which the approximation is
performed in real time, is of independent interest.
Deterministic smoothing and ltering in the behavioral setting are closely related
to smoothing and ltering in the EIV setting. We solve the latter problems for systems
given in an input/state/output representation. The optimal lter is shown to be equivalent
to the classical Kalman lter derived for a related stochastic system. The result shows
that smoothing and ltering in the EIV setting are not fundamentally different from the
classical smoothing and Kalman ltering for systems driven by white noise input and with
measurement noise on the output.
Chapter 11: Approximate System Identication The approximate identication
problem, treated in Chapter 11, is the global total least squares (GlTLS) problem, i.e., the
mist minimization problem for an LTI model class of bounded complexity. This problem
is a natural generalization of the exact identication problem of Chapter 8 for the case when
the MPUM does not exist.
Because of the close connection with the STLS problem and because in Part I of the
book numerical solution methods are developed for the STLS problem, our goal in this
chapter is to link the GlTLS problem to the STLS problem. This is done in Section 11.2,
where conditions under which the equivalence holds are given. The most restrictive of these
conditions is the condition on the order of the identied system: it should be a multiple of
the number of outputs. Another condition is that the optimal approximation allows a xed
input/output partition, which is conjectured to hold generically.
In Section 11.3, we discuss several extensions of the GlTLS problem: treating exact
and latent variables and using multiple time series for the approximation. In Section 11.4,
the problem is specialized to what is called the approximate realization problem, where the
given data is considered to be a perturbed version of an impulse response, the related problem
of autonomous system identication, and the problem of nite time
2
model reduction.
In Section 11.5, we present simulation examples with data sets from the data base
for system identication DAISY. The results show that the proposed solution method is
effective and efcient for a variety of identication problems. i
i
i
i
Chapter 2
Approximate Modeling
via Mist Minimization
This chapter gives a more in-depth introduction to the problems considered in the book:
data tting by linear, bilinear, and quadratic static as well as linear time-invariant dynamic
models. In the linear case, the discrepancy between the data and the approximate model is
measured by the mist. In the nonlinear case, the approximation is dened as a quadratically
constrained least squares problem, called adjusted least squares.
The main notions are data, model, and mist. Optimal exact modeling aims to t the
data and as little else as possible by a model in a given model class. The model obtained
is called the most powerful unfalsied model (MPUM). The MPUM may not exist in a
specied model class. In this case we accept a falsied model that ts optimally the data
according to the mist approximation criterion. The total least squares (TLS) problem
and its variations, generalized total least squares (GTLS) and weighted total least squares
(WTLS), are special cases of the general mist minimization problem for the linear static
model. In the dynamic case, the mist minimization problem is called the global total least
squares (GlTLS) problem.
An overview of the solution methods that are used is given. The mist minimization
problemhas a quadratic cost function and a bilinear equality constraint. This is a nonconvex
optimization problem, for whose solution we employ local optimization methods. The
bilinear structure of the constraint, however, allows us to solve the optimization problem
partially. This turns the constrained optimization problem into an equivalent nonlinear
least squares problem. The adjusted least squares method, on the other hand, leads to a
generalized eigenvalue problem.
2.1 Data, Model, Model Class, and Exact Modeling
Consider a phenomenon to be described by a mathematical model. Certain variables, related
to the phenomenon, are observable, and the observed data from one or more experiments is
recorded. Using prior knowledge about the phenomenon, a model class of candidate models
is selected. Then the model is chosen from the model class that in a certain specied sense
most adequately describes the available data.
15 i
i
i
i
16 Chapter 2. Approximate Modeling via Mist Minimization
We now formalize this modeling procedure. Call a data point recorded from an ex-
periment an outcome and let U be the universum of possible outcomes from an experiment.
The observed data D is collected from experiments, so that it is a subset D U of the
universum.
Following the behavioral approach to system theory [PW98],
we dene a model B to be a set of outcomes, i.e., B U .
Actually, for the purpose, of modeling this denition is a bit restrictive. Often the outcomes
are functions of the to-be-modeled variables, i.e., the variables that we aimto describe by the
model. By postulating the model to be a subset of the universum of outcomes, we implicitly
assume that the observed variables are the to-be-modeled variables.
If for a particular experiment an observed outcome d U is such that d B, then
we say that B explains d or B is unfalsied by d. In this case the model ts the data
exactly. If d , B, we say that the outcome d falsies B. In this case the model may t the
data only approximately.
Let B
1
and B
2
be two models such that B
1
B
2
. We say that B
1
is simpler (less
complex) than B
2
. Simpler means allowing fewer outcomes. If U is a vector space
and we consider models that are (nite dimensional) subspaces, simpler means a lower
dimensional subspace. Note that our notion of simplicity does not refer to a simplicity of a
representation of B.
Simpler models are to be preferred over more complicated ones. Consider the two
statements d B
1
and d B
2
with B
1
B
2
. The rst one is stronger and therefore
more useful than the second one. In this sense, B
1
is a more powerful model than B
2
.
On the other hand, the a priori probability that a given outcome d U falsies the
model B
1
is higher than it is for the model B
2
. This shows a trade-off in choosing an
exact model. The extreme cases are the model U that explains every outcome but says
nothing about an outcome and the model d that explains only one outcome but completely
describes the outcome.
Next, we introduce the notion of a model class. The set of all subsets of U is denoted
by 2
U
. In our setting, 2
U
is the set of all models. A model class M 2
U
is a set of
candidate models for a solution of the modeling problem. In theory, an arbitrary model
class can be chosen. In practice, however, the choice of the model class is crucial in order
to be able to obtain a meaningful solution. The choice of the model class is dictated by
the prior knowledge about the modeled phenomenon and by the difculty of solving the
resulting approximation problem. We aimat general model classes that still lead to tractable
problems.
The most reasonable exact modeling problem is to nd the model B
mpum
M
that explains the data D and as little else as possible. The model B
mpum
is called
the most powerful unfalsied model (MPUM) for the data D in the model class M.
The MPUM need not exist, but if it exists, it is unique.
Suppose that the data D is actually generated by a model B M; i.e., d Bfor all
d D. Afundamental question that we address is, Under what conditions can the unknown
model B be recovered exactly from the data? Without any other a priori knowledge (apart
from the given data D and model class M), this question is equivalent to the question,
Under what conditions does B
mpum
= B? i
i
i
i
2.2. Mist and approximate modeling 17
2.2 Mist and Approximate Modeling
The MPUM may not exist for a given data and model class. In fact, for rough data, e.g.,
data collected from a real-life experiment, if the MPUM exists, it tends to be B
mpum
= U .
Therefore, the exact modeling problem has either no solution or a trivial one. Although the
concept of the MPUM is an important theoretical tool, the computation of the MPUM is not
a practical modeling algorithm. What enables the modeling procedure to work with rough
data is approximation.
In an approximate modeling problem, the model is required to explain the data only
approximately; i.e., it could be falsied by the data. Next, we dene an approximation
criterion called mist. The mist between an outcome d U and a model B U is
a measure for the distance from the point d to the set B. As usual, this is dened as the
distance from d to the point

d

in Bthat is closest to d. (The hat notation, as in



d, means
an approximation of.) For example, if U is an inner product space and B is a closed
subspace, then

d

is the projection of d on B.
Underlying the denition of the mist is a distance on U . Let U be a normed vector
space with a norm| |
U
and dene the distance (induced by the norm| |
U
) between two
outcomes d,

d U as |d

d|
U
.
The mist between an outcome d and a model B(with respect to the norm| |
U
)
is dened as
M(d, B) := inf

dB
|d

d|
U
.
It measures the extent to which the model B fails to explain the outcome d.
A global minimum point

d

is the best (according to the distance measure |d



d|
U
)
approximation of d in B. Alternatively, M(d, B) is the minimal distance between d and
an approximation

d compatible with the model B.
For data consisting of multiple outcomes D = d
1
. . . , d
N
, we choose N norms
| |
i
in U and dene M
i
(d
i
, B) to be the mist with respect to the norm | |
i
. Then the
mist between the data D and the model B is dened as
M
_
d
1
. . . , d
N
, B
_
:=
_
_
col
_
M
1
(d
1
, B), . . . , M
N
(d
N
, B)
_ _
_
. (M)
In the context of exact modeling, there is a fundamental trade-off between the power
and complexity of the model. A similar issue occurs in approximate modeling: an arbitrary
small mist can be achieved by selecting a complicated model. The trade-off nowis between
the worst achievable mist and the complexity of the model. The issue can be resolved,
for example, by xing a maximal allowed complexity. With a constraint on the complexity
(incorporated in the denition of the model class), the aim is to minimize the mist.
For a chosen mist M and model class M, the mist minimization problem aims
to nd a model

B in the model class that is least falsied by the data, i.e.,

B := arg min
BM
M(D, B). (APR) i
i
i
i
18 Chapter 2. Approximate Modeling via Mist Minimization
The approximation problem (APR) can be interpreted in terms of the MPUM as follows:
Modify the data as little as possible, so that the MPUM

Bfor the modied data

D
is in a specied model class M.
Next, we describe the important issue of a representation of a model and specify mist
minimization problems for particular model classes in terms of particular representations.
2.3 Model Representation and Parameterization
The denition of the model as a set of outcomes is general and powerful. It allows us to
consider linear and nonlinear, and static and dynamic, stationary and nonstationary models
in the same conceptual setting. For analysis, however, it is too abstract. It is often more
convenient to work with particular representations of the model in terms of equations that
capture the essential properties of the model.
For a given model B U , an equation f(d) = 0 with solution set equal to B, i.e.,
B = d U [ f(d) = 0 , (REPR)
is called a representation of B.
The function f : U R
g
that describes the model B is dened in terms of parameters.
Consider, for example, a real vector space U = R
n

and a linear function f

(d) =

d.
The vector R
n

parameterizes f

and via (REPR) also B.


Let f

(d) = 0 be a representation with a parameter vector R


n

. Different
values of result in different models B(). We can view the representation by f

as a
mapping B : R
n

2
U
from the parameter space to the set of models. A given set
of parameters R
n

corresponds to the set of models B() 2


U
, i.e., to a model
class. Assume that for a given representation f

and a given model class M, there is a


corresponding parameter set R
n

, such that M = B().


In terms of the representation f

, the mist minimization problem (APR) becomes


the following parameter optimization problem:

:= arg min

M
_
D, B()
_
. (APR)
The numerical implementation of the algorithms depend on the particular representation
chosen. From the point of view of the abstract formulation (APR), however, the represen-
tation issue is not essential. This is in contrast with approximation methods that minimize
an equation error criterion.
Consider a model B U with representation (REPR). An outcome d U that
is not consistent with the model B may not satisfy the equation, yielding e() := f

(d),
called equation error. The equation error for a given d is a function e : R
n

R
g
of the
parameter and therefore it depends on the model B(). Since
f

(d) = e() = 0 d B(), i


i
i
i
2.4. Linear static models and total least squares 19
we dene equation mist (lack of t in terms of equations representing the model)
M
eqn
(d, ) := |f

(d)|
eqn
,
where | |
eqn
is a norm dened in R
g
. The equation mist depends on the representation.
In contrast, the behavioral mist M is representation independent.
Note 2.1 (Latency) The equation error e can be viewed as an unobserved, latent variable.
From this alternative point of view the equation mist M
eqn
is the latency of Chapter 1.
As before, for multiple observed outcomes D = d
1
. . . , d
N
, we dene the equation
mists M
eqn,i
(d
i
, ) in terms of the norms | |
i
in R
g
, and
M
eqn
_
d
1
. . . , d
N
,
_
:=
_
_
col
_
M
eqn,1
(d
1
, ), . . . , M
eqn,N
(d
N
, )
_ _
_
. (Meqn)
Given a model class M, represented in the parameter space by the parameter set , an
approximation problem that minimizes the equation mist is

eqn
:= arg min

M
eqn
(D, ). (APReqn)
Solving (APReqn) is often easier than solving (APR), but the main disadvantage is that the
obtained approximation is representation dependent.
2.4 Linear Static Models and Total Least Squares
In the rest of this chapter we consider real valued data. For static problems, the universum
set U is dened to be R
d
. The available data D consists of N outcomes d
1
, . . . , d
N
R
d
.
We dene the data matrix D :=
_
d
1
d
N

R
dN
and the shorthand notation
_
d
1
d
N

B U : d
i
B, for i = 1, . . . , N.
A linear static model B is a linear subspace of U = R
d
.
Let m := dim(B) be the dimension of the model B and let L
d
m,0
be the set of all linear
static models with d variables of dimension at most m. (The 0 in the notation L
d
m,0
indicates
that the models in this model class are static.) The complexity of the model B is related
to its dimension m: the model is simpler, and therefore more powerful, when it has smaller
dimension.
The model Bimposes linear laws r

i
d = 0, r
i
R
d
on the outcomes. If Bis dened
by g linear laws r
1
, . . . , r
g
, then d Bif and only if Rd = 0, where R :=
_
r
1
r
g

.
Therefore, B = ker(R). The representation of B := ker(R) by the equation Rd = 0 is
called a kernel representation of B. Any linear model B admits a kernel representation
with a parameter R of full row rank.
The MPUM for the data D in the model class L
d
m,0
exists if and only if rank(D) m.
If the MPUM exists, it is unique and is given by B
mpum
= col span(D). For rough data
and with N > m, typically rank(D) = d, so that the MPUM either does not exist or is the
trivial model B
mpum
= R
d
. In such cases an approximation is needed. i
i
i
i
20 Chapter 2. Approximate Modeling via Mist Minimization
The mist minimization problem (APR) with model class M = L
d
m,0
and
2-norms | |
i
,

B
tls
= arg min
BL
d
m,0
_
min

DB
|D

D|
F
_
, (TLS)
is called the total least squares (TLS) problem.
The squared TLS mist
M
2
tls
(D, B) := min

DB
|D

D|
2
F
is equal to the sum of the squared orthogonal distances from the outcomes d
1
, . . . , d
N
to
the subspace B. For this reason, the TLS problem is also known as orthogonal regression.
In terms of a kernel representation, the TLS problem is equivalent to

R
tls
= arg min
RR

=I
_
min

D
|D

D|
F
subject to R

D = 0
_
. (TLS
R
)
Note 2.2 (Equation labels) (TLS) is the abstract, representation-free denition of the TLS
problem. Equivalent formulations such as (TLS
R
) are obtained when a particular repre-
sentation is chosen. We label frequently used equations with acronyms. Approximation
problems, derived froman abstract one, are labeled with the acronymof the abstract problem
with the standard variable used for the parameter in a subscript.
The variations of the TLS problem, called generalized total least squares (GTLS) and
weighted total least squares (WTLS), are mist minimization problems (APR) for the model
class L
d
m,0
and weighted norms | |
i
: in the GTLS case, |d|
i
:= |

Wd|, and in the WTLS


case, |d|
i
:= |

W
i
d|, for certain positive denite weight matrices W and W
i
. Clearly,
the TLS problem is a special case of the GTLS problem and the GTLS problem is a special
case of the WTLS problem.
The motivation for the weighted norms in the GTLS and WTLS problems comes from
statistics. Assume that the data D is generated according to the EIV model:
D =

D +

D, where

D

B L
d
m,0
. (EIV)
The model

Bis called the true model and

D =:
_

d
1


d
N

is called the measurement


error. The measurement error is modeled statistically as a zero mean random matrix. As-
suming in addition that the noise

d
i
on the ith outcome is independent of the noise on the
other outcomes and is normally distributed with covariances cov(

d
i
) =
2
W
1
i
, the max-
imum likelihood estimation principle leads to the WTLS problem. Therefore, the weight
matrices W
i
in the WTLS problem formulation correspond (up to the scaling factor
2
) to
the inverse of the measurement error covariance matrices in the EIV setup.
Note 2.3 (About the notation) We follow the system theoretic notation and terminology
that are adoptedinthe behavioral setting[PW98]. Translationof the ideas andthe formulas to
other equivalent forms is straightforward. For example, the systemof linear equations AX =
B, which is often the starting point for parameter estimation problems in the numerical linear
algebra literature, can be viewed as a special kernel representation
AX = B
_
X

_
A

_
= 0 = : RD = 0. i
i
i
i
2.5. Nonlinear static models and ellipsoid tting 21
Therefore, the model represented by the equation AX = B is B(X) := ker(
_
X

),
so that B(X) L
d
m,0
, with d = col dim(A) + col dim(B) and m = col dim(A). The
representation B(X) is what is called an input/output representation of a linear static model.
In terms of the representation AX = B, the TLS problem with a data matrix D =
_
A B

is the following parameter optimization problem:

X
tls
= arg min
X
_
min

A,

B
_
_
_
A

A B

B
_
_
F
subject to

AX =

B
_
. (TLS
X
)
It is not equivalent to (TLS), but generically B(

X
tls
) = ker(

R
tls
), where

R
tls
is the solution
of (TLS
R
). The nongeneric cases when

X
tls
does not exist occur as a consequence of the
used xed input/output partitioning of the variables in the representation B(X).
Note 2.4 (Quadratic cost function) Whenever | |
i
are weighted 2-norms, the squared
mist M
2
is a quadratic function of the decision variable

D. Squaring the cost function
results in an equivalent optimization problem(the optimumpoint is not changed), so that the
mist minimization problem can equivalently be solved by minimizing the squared mist.
The equation error minimization problem (APReqn) for the linear static model class
L
d
m,0
with a kernel representation B = ker(R) and 2-norms | |
i
is the quadratically
constrained least squares problem

R
ls
= arg min
RR

=I
|RD|
F
, (LS
R
)
which happens to be equivalent to the TLS problem.
The classical least squares problem

X
ls
= arg min
X
_
min

B
|B

B|
F
subject to AX =

B
_
(LS
X
)
is an equation error minimization problem (APReqn) for the representation AX = B and
for 2-norms | |
i
. In general, B(

X
ls
) ,= ker(

R
ls
), where

R
ls
is the solution of (LS
R
). It is
well known that the solution of (LS
X
) can be computed in a nite number of operations by
solving the system of normal equations. In contrast, the solution of (LS
R
) is given in terms
of the eigenvalue decomposition of DD

(or the singular value decomposition of D), of


which the computation theoretically requires an innite number of operations.
2.5 Nonlinear Static Models and Ellipsoid Fitting
An outcome d U = R
d
, consistent with a linear static model, satises linear relations
Rd = 0. An outcome d U = R
d
, consistent with a nonlinear static model, satises
nonlinear relations f(d) = 0, where f : R
d
R
g
. We consider nonlinear models with
representations that are dened by a single bilinear or quadratic function.
The function f : R
d
R is bilinear if f(d) = d

1
Xd
2
d
3
, for all d R
d
and
for a certain X R
d
1
d
2
, where d =: col(d
1
, d
2
, d
3
) (with d
1
R
d
1
, d
2
R
d
2
, and
d
3
R). For given d
1
and d
2
, such that d = d
1
+d
2
+1, a bilinear model with a parameter
X R
d
1
d
2
is dened as follows:
B
bln
(X) :=
_
col(d
1
, d
2
, d
3
) R
d
[ d

1
Xd
2
= d
3
_
; (BLN) i
i
i
i
22 Chapter 2. Approximate Modeling via Mist Minimization
i.e., a bilinear model is a nonlinear model that allows the representation f(d) = 0, with f a
bilinear function. Let M
bln
be the set of all bilinear models of the form (BLN),
M
bln
:=
_
B
bln
(X) [ X R
d
1
d
2
_
.
In terms of the parameterization (BLN), the mist minimization problem (APR) for the
bilinear model class M
bln
with 2-norms | |
i
is
min
X
_
min

D
|D

D|
F
subject to

d

i,1
X

d
i,2
=

d
i,3
, for i = 1, . . . , N
. .
M(D,B
bln
(X))
_
. (BLNTLS)
The function f : R
d
R is quadratic if f(d) = d

Ad +d

b +c for all d R
d
and
for certain A R
dd
, b R
d
, and c R. A quadratic model with parameters A, b, c is
dened as follows:
B
qd
(A, b, c) :=
_
d R
d
[ d

Ad +d

b +c = 0
_
. (QD)
The set of outcomes consistent with the quadratic model are ellipsoids, paraboloids, hyper-
boloids, etc., in R
d
. Let M
qd
be the set of all quadratic models,
M
qd
:=
_
B
qd
(A, b, c)

_
A b
b

c
_
is a symmetric (d + 1) (d + 1) matrix
_
.
In terms of the parameterization (QD), the mist minimization problem(APR) for the model
class M
qd
and 2-norms | |
i
is
min
A,b,c
A=0
_
min

D
|D

D|
F
subject to
_

d
i
1
_ _
A b/2
b

/2 c
_ _

d
i
1
_

= 0, for i = 1, . . . , N
. .
M(D,B
qd
(A,b,c))
_
.
(QDTLS)
Problems (BLNTLS) and (QDTLS) have the same geometric interpretation as the
TLS problemminimize the sum of squared orthogonal distances from the data points to
the estimated model. In the special case when A > 0 and 4c < b

A
1
b, B
qd
(A, b, c) is
an ellipsoid and the approximation problem becomes an ellipsoid tting problem. Because
of the geometrically appealing cost function, the mist minimization problem for ellipsoid
tting attracted much attention in the literature. Nevertheless, in the nonlinear case, we
do not solve the mist minimization problems (BLNTLS) and (QDTLS) but alternative
modeling problems, called adjusted least squares (ALS). The reasons are
1. the minimization problems (BLNTLS) and (QDTLS) are expensive to solve, and
2. the solutions of these problems do not dene consistent estimators.
In the EIV setting, i.e., assuming that the outcomes come from a true model with stochastic
measurement error, the aimis tondconsistent estimators. Anestimator is consistent whenit
converges asymptotically to the true model as the number N of observed outcomes increases.
The estimators dened by the orthogonal regression problems (BLNTLS) and (QDTLS)
are not consistent, but the estimator dened by the ALS method is consistent. In addition,
the computation of the ALS estimator reduces to a generalized eigenvalue computation and
does not require expensive optimization methods. i
i
i
i
2.6. Dynamic models and global total least squares 23
2.6 Dynamic Models and Global Total Least Squares
Indynamic problems, the data consists of one or more time series w
d
=
_
w
d
(1), . . . , w
d
(T)
_
.
Note 2.5 (Notation w
d
) The letter d in subscript stands for data. It is used to distinguish
a general time series w from a particular given one w
d
.
In the context of dynamic problems, we associate U with the set of sequences (R
w
)
T
. The
dynamic nature of a model B is expressed in the existence of relations among the values
of a time series w B at consecutive moments of time. Restricting ourselves to linear
constant coefcient relations, this yields the following difference equation:
R
0
w(t) +R
1
w(t + 1) + +R
l
w(t +l) = 0, for t = 1, . . . , T l. (DE)
For l = 0 (no time shifts in the linear relations), (DE) describes a linear static model. As
in the static case, (DE) is called a kernel representation of the system.
1
The system induced
by (DE) is denoted as follows:
B = ker
_
R()
_
:=
_
w (R
w
)
T
[ (DE) holds
_
, where R(z) :=

l
i=0
R
i
z
i
, (KR)
and is the shift operator: (w)(t) = w(t + 1).
Let B = ker
_
R()
_
with a rowproper polynomial matrix R(z) R
pw
[z] and dene
l := deg(R), m := wp. It can be shown that for T sufciently large, dim(B) Tm+lp.
Thus the complexity of the system, which is related to dim(B), is specied by the maximum
lag l and the integer m. Under the above assumption, m is equal to the input cardinality of the
system, i.e., the number of inputs in an input/output representation of the system. We denote
by L
w
m,l
the class of all linear time-invariant (LTI) systems with w variables, maximum input
cardinality m, and maximum lag l. Note that the class of systems L
w
m,0
, described by zero
lag difference equations, is the set of linear static systems of dimension at most m as dened
before.
Modeling a dynamic system from data is called system identication. We consider
the identication problem for the LTI model class M = L
w
m,l
and treat rst the exact
identication problem: given data w
d
, such that w
d
B L
w
m,l
, nd a representation
of B. Under certain identiability conditions on the data and the model class, the MPUM
B
mpum
of w
d
in the model class L
w
m,l
exists and is equal to B.
We consider algorithms for passing fromw
d
to a kernel or input/state/output represen-
tation of B
mpum
. The algorithms are in the setting of what are called subspace identication
methods; i.e., the parameters of the system are retrieved from certain subspaces computed
from the given data. We do not emphasize the geometric interpretation and derivation of
the subspace algorithms and give instead more system theory oriented derivations.
In their pure form, exact identication algorithms are mainly of theoretical interest.
Most systemidenticationproblems start fromroughdata, sothat the approximationelement
is critical. The exact identication algorithms can be modied so that they can work with
rough data. We do not pursue this approach but consider instead the mist approximation
problem (APR), which is optimization based.
1
We do not distinguish between model and system but preferably use model in the static context or in general
discussions and system in the dynamic context. i
i
i
i
24 Chapter 2. Approximate Modeling via Mist Minimization
The mist minimization problem (APR) with model class M = L
w
m,l
and 2-norm
| |
U
is called the global total least squares problem (GlTLS). In terms of the
kernel representation (KR), the GlTLS problem is
min
R(z)
_
min
w
|w
d
w| s.t. w B := ker
_
R()
_
. .
M(w
d
,ker(R()))
_
s.t. R full row rank.
(TLS
R(z)
)
The constraint R(z) full row rank, deg(R) = l is equivalent to B := ker
_
R()
_
L
w
m,l
and the constraint w Bis equivalent to (DE). In turn, (DE) can be written as the structured
system of equations
_
R
0
R
1
R
l

_
w(1) w(2) w(T l)
w(2) w(3) w(T l + 1)
.
.
.
.
.
.
.
.
.
w(l + 1) w(l + 2) w(T)
_

_
= 0,
which makes a link with the structured total least squares problem.
2.7 Structured Total Least Squares
The GlTLS problem (TLS
R(z)
) is similar to the TLS problem, the main difference being
that the generally unstructured matrix

D in the TLS problem is replaced by a block-Hankel
structured matrix in the GlTLS problem. In this section, we dene a general approximation
problem with a constraint expressed as rank deciency of a structured matrix.
Let S : R
n
p
R
m(n+d)
be an injective function. A matrix C R
m(n+d)
is
said to be S-structured if C image(S). The vector p for which C = S(p) is called
the parameter vector of the structured matrix C. Respectively, R
n
p
is called the parameter
space of the structure S.
The structuredtotal least squares (STLS) problemaims tondanoptimal structured
low-rank approximation S( p) of a given structured matrix S(p); i.e., given a
structure specication S, a parameter vector p, and a desired rank n, nd
p
stls
= arg min
p
|p p| subject to rank
_
S( p)
_
n. (STLS)
By representing the rank constraint in (STLS) as there is a full row rank matrix R
R
d(n+d)
, such that RS

( p) = 0, the STLS problem can be written equivalently as

R
stls
= arg min
RR

=I
d
min
p
|p p| subject to RS

( p) = 0, (STLS
R
)
which is a double minimization problem, similar to the general mist minimization prob-
lem (APR). The STLS formulation, however, is not linked with a particular model class: it
is viewed as a exible tool that can match different mist minimization problems.
Table 2.1 gives a summary of the mist minimization problems described up to now. i
i
i
i
2.8. Algorithms 25
Table 2.1. Mist minimization problems.
Name U M Problem
TLS R
d
L
d
m,0
min

BL
d
m,0
|D

D|
F
GTLS R
d
L
d
m,0
min

BL
d
m,0
_

i
|

W(d
i


d
i
)|
2
WTLS R
d
L
d
m,0
min

BL
d
m,0
_

i
|

W
i
(d
i


d
i
)|
2
Bilinear R
d
M
bln
min

BM
bln
|D

D|
F
Quadratic R
d
M
qd
min

BM
qd
|D

D|
F
GlTLS (R
w
)
T
L
w
m,l
min

BL
w
m,l
|w
d
w|

2
2.8 Algorithms
Optimization Methods
The approximate modeling problem (APR) is a double minimization problem: on the inner
level is the mist computation and on the outer level is the search for the optimal model. In
the linear case, the model Bis a subspace of U , so that the mist computation is equivalent
to projection of the data D on B. In this case, it is possible to express the mist M(D, B)
in a closed form. The outer minimization problem min
BM
M(D, B), however, is a
nonlinear least squares problem. We employ local optimization methods for its numerical
solution. The local optimization methods require initial approximation and nd only one
locally optimal model.
Important issues we deal with are nding good and computationally inexpensive initial
approximations and making the mist function and its rst derivative evaluation numerically
efcient. By solving these issues, we obtain an engineering solution of the problem, i.e.,
a solution that is effective for real-life applications.
Caveat: We aimat efcient evaluation of the mist function, which ensures efciency
only with respect to the amount of given data: in the static case, the number N of observed
outcomes and in the dynamic case the length T of the observed time series. In this book,
we do not address the related question of achieving efciency on the level of the outer
minimization problem, i.e., with respect to the number of model parameters. Thus an
implicit assumption throughout this work is that a simple approximate model we aim for.
Dealing with large scale nonlinear least squares problems, however, is a well developed
topic (see, e.g., [BHN99]) so that general purpose solutions can be used.
Adjusted Least Squares Method
In the nonlinear case not only the outer minimization but also the mist computation is a
nonconvex problem and requires iterative solution methods. This makes the mist mini-
mization problem numerically rather expensive. In addition, from a statistical point of view i
i
i
i
26 Chapter 2. Approximate Modeling via Mist Minimization
Table 2.2. Problems, algorithms, and application elds.
Problem Algorithm Application eld
WTLS optimization chemometrics
STLS optimization system identication
bilinear model approximation ALS motion analysis
quadratic model approximation ALS ellipsoid estimation
the obtained solution is not attractive, because in the EIV setting, it is inconsistent. For
these reasons, we adopt an alternative approach.
The ALS method is a quadratically constrained least squares method. Its solution is
obtained from a generalized eigenvalue decomposition. The problem is motivated and de-
rived fromthe consideration of obtaining a consistent estimator in the EIVsetting. Knowing
the noise variance, the bias of the ordinary least squares method is removed. This involves
adding a correction to the sample covariance matrix. If the noise variance is unknown, it
can be estimated together with the model parameters.
Benchmark examples showthat the ALS estimator gives good ts that are comparable
with those obtained from the orthogonal regression methods. The advantage over the mist
approximation problem, however, is that the ALS approximation does not depend on a
user-supplied initial approximation and is computationally less expensive.
Table 2.2 gives a summary of the problems, algorithms, and applications considered
in the book.
Software Implementation
The algorithms in the book have documented software implementation. Each algorithm is
realized by one or more functions and the functions are separated in the following packages:
MATLAB software for weighted total least squares approximation,
C software for structured total least squares approximation,
MATLAB software for balanced model identication, and
MATLAB software for approximate system identication.
Many of the simulation examples presented in the book are included in the packages as
demo les. Thus the reader can try out these examples and modify the simulation settings.
Appendix B gives the necessary background information for starting to use the software. i
i
i
i
Part I
Static Problems
27 i
i
i
i i
i
i
i
Chapter 3
Weighted Total Least
Squares
We start this chapter with the simplest of the approximate modeling problemsthe ones for
the linear static model. The kernel, image, and input/output representations of a linear static
model are reviewed in Section 3.2. The mist criterion is dened as a weighted projection
and the corresponding mist approximation problem is called weighted total least squares
(WTLS). Two interpretations of the weight matrices in the WTLS formulation are described
in Section 3.1. The TLS and GTLS problems are special cases of the WTLS problem and
are considered in Section 3.3.
In Section 3.4, we start to discuss the solution of the general WTLS problem. First,
the mist computation is explained. It is a quadratic minimization problem, so that its
solution reduces to solving a linear system of equations. The remaining part of the WTLS
problem is the minimization with respect to the model parameters. This problem is treated
in Section 3.5, where three different algorithms are presented. In Section 3.6, we show
simulation results that compare the performance of the algorithms.
3.1 Introduction
In this chapter, we consider approximate modeling by a linear static model. Therefore, the
universum of possible outcomes is U = R
d
, the available data is the set of N outcomes
D = d
1
, . . . , d
N
U , and the model class is M = L
d
m,0
. The parameter m species
the maximum allowed complexity for the approximating model.
The matrix D :=
_
d
1
d
N

is called the data matrix. We use the shorthand


notation
_
d
1
d
N

B U : d
i
B, for i = 1, . . . , N.
The WTLS mist between the data D and a model B L
d
m,0
is dened as follows:
M
wtls
_ _
d
1
d
N

, B
_
:= min

d
1
,...,

d
N
B

_
N

i=1
(d
i


d
i
)

W
i
(d
i


d
i
), (Mwtls)
where W
1
, . . . , W
N
are given positive denite matrices.
29 i
i
i
i
30 Chapter 3. Weighted Total Least Squares
Problem 3.1 (WTLS). Given the data matrix D =
_
d
1
d
N

R
dN
, a complexity
bound m, and positive denite weight matrices W
1
, . . . , W
N
, nd an approximate model

B
wtls
:= arg min

BL
d
m,0
M
wtls
(D,

B). (WTLS)
Note 3.2 (Element-wise weighted total least squares) The special case when all weight
matrices are diagonal is called element-wise weighted total least squares (EWTLS). Let
W
i
= diag(w
1,i
, . . . , w
d,i
) and dene the d N matrix by
ji
:=

w
j,i
for all j, i.
Denote by the element-wise product AB =
_
a
ij
b
ij

. Then
N

i=1
d

i
W
i
d
i
= | D|
2
F
,
where D :=
_
d
1
d
N

is the correctionmatrixD

D. InNote 3.7, we comment


on a statistical interpretation of the EWTLS problem and in Note 3.17 on a relation with a
GTLS problem.
Note 3.3 (TLS as an unweighted WTLS) The extreme special case when W
i
= I for
all i is called unweighted. Then the WTLS problem reduces to the TLS problem. The
TLS mist M
tls
weights equally all elements of the correction matrix D. It is a natural
choice when there is no prior knowledge about the data. In addition, the unweighted case
is computationally easier to solve than the general weighted case.
In the unweighted case,

D tends to approximate better the large elements of D than
the small ones. This effect can be reduced by introducing proper weights, for example the
reciprocal of the entries of the data matrix. The resulting relative error TLS problem is a
special case of the WTLS problem with W
i
:= diag(1/d
2
1i
, . . . , 1/d
2
di
) or, equivalently, an
EWTLS problem with
ji
= 1/d
ji
.
Problem 3.4 (Relative error TLS). Given the data matrix D R
dN
and a complexity
bound m, nd an approximate model

B
rtls
:= arg min
BL
d
m,0
min

DB

_
N

i=1
d

j=1
(D
ji


D
ji
)
2
D
2
ji
. (RTLS)
The mist function of the relative error TLS problem is
M
rtls
(D, B) = min

DB
| (D

D)|
F
, where :=
_
1/d
ji

.
Example 3.5 (Relative errors) Consider the data matrix
D =
_
5.4710 0.2028 0.5796 0.6665 0.6768
0.9425 0.7701 0.7374 0.8663 0.9909
_
, i
i
i
i
3.1. Introduction 31
obtained from an experiment with d = 2 variables and N = 5 observed outcomes. We aim
to model this data, using the model class L
2
1,0
. Note that the elements of D, except for D
11
,
are approximately ve times smaller than D
11
. The matrices of the element-wise relative
errors
D
rel
:=
_
[d
ji


d
ji
[
[d
ji
[
_
for the TLS and relative error TLS solutions are, respectively,
D
rel,tls
=
_
0.0153 0.8072 0.2342 0.2404 0.2777
0.3711 0.8859 0.7673 0.7711 0.7907
_
and
D
rel,rtls
=
_
0.8711 0.2064 0.0781 0.0661 0.0030
0.1019 0.5322 0.0674 0.0584 0.0030
_
.
Note that D
rel,tls,11
= 0.0153 is small but the other elements of D
rel,tls
are larger. This is
a numerical illustration of the above-mentioned undesirable effect of using the TLS method
for approximation of data with elements of very different magnitude. The corresponding
total relative errors |D
rel
|
F
, i.e., the mists M
rtls
(D,

B), are |D
rel,tls
|
F
= 1.89 and
|D
rel,rtls
|
F
= 1.06. The example illustrates the advantage of introducing element-wise
scaling in the approximation criterion, in order to achieve adequate approximation.
Another situation in which weights are needed is when the data matrix is a noisy mea-
surement of a true matrix that satises a true model

B L
d
m,0
in the model class. Intuition
suggests that the elements perturbed by noise with larger variance should be weighted less in
the cost function. The precise formulation of this intuitive reasoning leads to the maximum
likelihood criterion for the EIV model. The EIV model for the WTLS problem is dened
as follows.
Denition 3.6 (WTLS EIV model). The data is a noisy measurement D =

D+

D of true
data

D

B L
d
m,0
, where

Bis a true model in the model class, and

Dis the measurement
error. In addition, the vector of measurement errors vec(

D) is zero mean and Gaussian,
with covariance matrix
2
diag(V
1
, . . . , V
N
), i.e., vec(

D) N
_
0,
2
diag(V
1
, . . . , V
N
)
_
.
In the estimation problem, the covariance matrices V
1
, . . . , V
N
are assumed known but
2
need not be known.
Note 3.7 (Statistical interpretation of the EWTLS problem) From a statistical point of
view, the EWTLS problem formulation corresponds to a WTLS EIV setup in which all
measurement errors are uncorrelated. We refer to this EIV model as the EWTLS EIV
model.
By choosing W
i
= V
1
i
, the approximation

B
wtls
is the maximumlikelihood estimate
of

Bin the WTLSEIVmodel. Under additional assumptions (see [KV04])s it is a consistent
estimator of

B. i
i
i
i
32 Chapter 3. Weighted Total Least Squares
100 200 300 400 500 600 700
0
0.02
0.04
0.06
0.08
0.1
0.12
N
e
(
N
)
wls
tls
wtls
gtls
Figure 3.1. Relative error of estimation e as a function of N for four estimators.
Note 3.8 (Noise variance estimation) The optimal solution

B
wtls
does not depend on a
scaling of the weight matrices by the same scaling factor; i.e., the optimal solution with
weights
2
W
i
does not depend on
2
. It turns out that the normalized optimal mist
M(D,

B
wtls
)/N is an estimate of the noise variance
2
in the EIV model.
Example 3.9 (Consistency) We set up a simulation example corresponding to the WTLS
EIV model with d = 3, m = 2, and N ranging from 75 to 750. Let U(u, u) be a matrix
of independent and uniformly distributed elements in the interval [u, u]. The true data
matrix is

D = U(0, 1) and the measurement error covariance matrices are
2
V
i
=
diag(
2
i1
,
2
i2
,
2
i3
), where
i1
=
i2
= U(0.01, 0.26), and
i3
= U(0.01, 0.035).
For a xed N [75, 750], 500 noise realizations are generated and the estimates are
computed with the following methods:
TLS total least squares (W
i
= W = I),
GTLS generalized total least squares (W
i
= W = V
1
avr
, where V
avr
:= (

N
i=1

V
i
/N)
2
),
WTLS weighted total least squares (W
i
= V
1
i
), and
WLS weighted least squares (that minimizes a weighted norm of the equation error of an
input/output representation; see Section 3.2).
Arelative error of estimation e that measures the distance fromthe estimated model

Bto the
true one

B (in terms of the parameter X in an input/output representation; see Section 3.2)
is averaged for the 500 noise realizations and plotted as a function of N in Figure 3.1.
Convergence of the relative error of estimation to zero as N increases indicates consistency
of the corresponding estimator.
The stochastic framework gives a convincing interpretation of the weight matrices W
i
.
Also, it suggests possible ways to choose them. For example, they can be selected by
noise variance estimation from repeated measurements or from prior knowledge about the
accuracy of the measurement devices. i
i
i
i
3.2. Kernel, image, and input/output representations 33
A practical application of the WTLS problem occurs in chemometrics, where the aim
is to estimate the concentrations of certain chemical substances in a mixture from spectral
measurements on the mixture. For details see [WAH
+
97, SMWV05].
3.2 Kernel, Image, and Input/Output Representations
In this section we review three common representations of a linear static model: kernel, im-
age, and input/output. They give different parameterizations of the model and are important
in setting up algorithms for approximate modeling with the model class L
d
m,0
.
Kernel Representation
Let B L
d
m,0
; i.e., Bis a subspace of R
d
with dimension at most m. Akernel representation
of B is given by a system of equations Rd = 0, such that B = d R
d
[ Rd = 0 or,
equivalently, by B = ker(R). The matrix R R
gd
is a parameter of the model B.
The parameter R is not unique. There are two sources for the nonuniqueness:
1. R might have redundant rows, and
2. for a full-rank matrix U, ker(R) = ker(UR).
The parameter R having redundant rows is related to the minimality of the representation.
For a given linear static model B, the representation Rd = 0 of B is minimal if R has the
minimal number of rows among all parameters R that dene a kernel representation of B.
The kernel representation, dened by R, is minimal if and only if R is full row rank.
Because of item 2, a minimal kernel representation is still not unique. All minimal
representations, however, are relatedtoa givenone via a premultiplicationof the parameter R
with a nonsingular matrix U. In a minimal kernel representation, the rows of R are a basis
for B

, the orthogonal complement of B, i.e., B

= rowspan(R). The choice of R is


nonunique due to the nonuniqueness in the choice of basis of B

.
Assuming that B L
d
m,0
and dim(B) = m, the minimal number of laws necessary to
dene Bis p := dm; i.e., in a minimal representation, B = ker(R) with rowdim(R) = p.
Image Representation
The dual of the kernel representation B = ker(R) is the image representation
B = d R
d
[ d = Pl, l R
l

or, equivalently, B = col span(P). Again, for a given B L


d
m,0
, an image representation
B = col span(P) is not unique because of the possible nonminimality of P and the choice
of basis. The representation is minimal if and only if P is a full column rank matrix. In a
minimal image representation, col dim(P) = dim(B) and the columns of P form a basis
for B. Clearly, col span(P) = col span(PU), for any nonsingular matrix U R
ll
.
Note that
ker(R) = col span(P) = B L
d
m,0
= RP = 0,
which gives a link between the parameters P and R. i
i
i
i
34 Chapter 3. Weighted Total Least Squares
Input/Output Representation
Both the kernel and the image representations treat all variables on an equal footing. In
contrast, the more classical input/output representation
B
i/o
(X) := d =: col(d
i
, d
o
) R
d
[ X

d
i
= d
o
(I/Orepr)
distinguishes free variables d
i
R
m
, called inputs, and dependent variables d
o
R
p
, called
outputs. In an input/output representation, d
i
can be chosen freely, while d
o
is xed by d
i
and the model.
The partitioning d = col(d
i
, d
o
) gives an input/output partitioning of the variables: the
rst m := dim(d
i
) variables are inputs and the remaining p := dim(d
o
) = dm variables are
outputs. An input/output partitioning is not unique. Given a kernel or image representation,
nding an input/output partitioning is equivalent to selecting a p p full-rank submatrix
of R or an m m full-rank submatrix of P. In fact, generically, any splitting of the variables
into a group of p variables (outputs) and a group of remaining variables (inputs) denes a
valid input/output partitioning. In nongeneric cases, certain partitionings of the variables
into inputs and outputs are not possible.
Note that in (I/Orepr), the rst m variables are xed to be inputs, so that given X,
the input/output representation B
i/o
(X) is xed and vice versa; given B L
d
m,0
, the
parameter X (if it exists) is unique. Thus, as opposed to the parameters Rand P in the kernel
and the image representations, the parameter X in the input/output representation (I/Orepr)
is unique.
Consider the input/output representation B
i/o
(X) of L
d
m,0
B. The matrices
R =
_
X

and P =
_
I
X

_
are parameters of, respectively, kernel and image representations of B, i.e.,
B
i/o
(X) = ker
__
X

I
_
= col span
__
I
X

__
.
Conversely, given the parameters
R =:
_
R
i
R
o

, R
o
R
pp
and P =:
_
P
i
P
o
_
, P
i
R
mm
,
of, respectively, kernel and image representations of B L
d
m,0
, and assuming that R
o
and P
i
are nonsingular,
X

= R
1
o
R
i
= P
o
P
1
i
is the parameter of the input/output representation (I/Orepr) of B, i.e.,
ker
_
m p
_
R
i
R
o
_
= col span
__
P
i
P
o
_
m
p
_
= B
i/o
_
(R
1
o
R
i
)

_
= B
i/o
_
(P
o
P
1
i
)

_
.
Figure 3.2 shows the links among kernel, image, and input/output representations. i
i
i
i
3.3. Special cases with closed form solutions 35
B = ker(R)

RP=0

=R
1
o
R
i

s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
B = col span(P)
X

=P
o
P
1
i
.s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
B = B
i/o
(X)
R=[X

I]
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
P

=[I X]

s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
Figure 3.2. Links among kernel, image, and input/output representations of B L
d
m,0
.
Note 3.10 (Weighted least squares) The input/output latency minimization problem

X
wls
= arg min
X

_
N

i=1
(X

d
i,i
d
o,i
)

W
i
(X

d
i,i
d
o,i
), (WLS
X
)
corresponding to problem (APReqn) with |e|
i
= |

W
i
e|, is the weighted LS problem.
Note 3.11 (AX = B notation) Astandard notation adopted in the numerical linear algebra
literature for the input/output linear static model representation (I/Orepr) is a

X = b

,
i.e., a = d
i
and b = d
o
. For repeated observations D =
_
d
1
d
N

, the statement
D B
i/o
(X) is equivalent to the linear system of equations AX = B, where
_
A B

:=
D

with A R
Nm
and B R
Np
.
3.3 Special Cases with Closed Form Solutions
The special cases
W
i
= I, i.e., the total least squares problem, and
W
i
= W, i.e., the generalized total least squares problem,
allow closed form solution in terms of the singular value decomposition (SVD) of the data
matrix D =
_
d
1
d
N

. For general weight matrices W


i
, however, the WTLSproblem
has no closed form solution and its solution is based on numerical optimization methods
that are less robust and efcient. For this reason, recognizing the special cases and applying
the special methods is important.
The following lemmas are instrumental for the solution of the TLS problem.
Lemma 3.12. For D R
dN
and m N, D B L
d
m,0
rank(D) m.
Proof. Let D B L
d
m,0
and consider a minimal kernel representation of B = ker(R),
where R R
pd
is full row rank. Then D B L
d
m,0
= RD = 0 =
rank(D) m.
Now let D be rank decient with rank(D) m. Then there is a full row rank matrix
R R
pd
, p := d m, that annihilates D, i.e., RD = 0. The matrix R denes a model in
the class L
d
m,0
via B := ker(R). Then RD = 0 = D B L
d
m,0
. i
i
i
i
36 Chapter 3. Weighted Total Least Squares
Lemma 3.12 shows that the approximation of the data matrix D with a model in the
class L
d
m,0
is equivalent to nding a matrix

D R
dN
with rank at most m. In the case
when the approximation criterion is |D

D|
F
(TLS problem) or |D

D|
2
, the problem
has a solution in terms of the SVD of D. The result is known as the EckartYoungMirsky
low-rank matrix approximation theorem [EY36]. We state it in the next lemma.
Lemma 3.13 (Matrix approximation lemma). Let D = UV

be the SVDof D R
dN
and partition the matrices U, =: diag(
1
, . . . ,
d
), and V as follows:
U =:
m p
_
U
1
U
2

d , =:
m p
_

1
0
0
2
_
m
p
and V =:
m p
_
V
1
V
2

N , (SVDPRT)
where m N is such that 0 m min(d, N) and p := d m. Then the rank-m matrix

= U
1

1
V

1
is such that
|D

D

|
F
= min
rank(

D)m
|D

D|
F
=
_

2
m+1
+ +
2
d
.
The solution

D

is unique if and only if


m+1
,=
m
.
The solution of the TLS problem (TLS) trivially follows from Lemmas 3.12 and 3.13.
Theorem 3.14 (Solution of the TLS problem). Let D = UV

be the SVD of D and


partition the matrices U, , and V as in (SVDPRT). Then a TLS approximation of D
in L
d
m,0
is

D
tls
= U
1

1
V

1
,

B
tls
= ker(U

2
) = col span(U
1
),
and the corresponding TLS mist is
|D

D
tls
|
F
=
_

2
m+1
+ +
2
d
, where
2
=: diag(
m+1
, . . . ,
d
).
A TLS approximation always exists. It is unique if and only if
m
,=
m+1
.
Note 3.15 (Efcient computation of

B
tls
) If one is interestedinanapproximate model

B
tls
and not in an approximated data

D
tls
, the computation can be done more efciently. A TLS
model

B
tls
depends only on the left singular vectors of D. Therefore, for any orthogonal
matrix Q, a TLS approximation computed for the data matrix DQ is still

B
tls
(the left
singular vectors are not affected by the multiplication with Q). Let D =
_
R
1
0

be
the QRfactorization of D. ATLS approximation of R
1
is the same as a TLS approximation
of D. For N d, computing the QR factorization D =
_
R
1
0

and the SVD of R


1
is a more efcient alternative for nding

B
tls
than computing the SVD of D.
Note 3.16 (Nongeneric TLS problems) The TLS problem formulation (TLS
X
) suffers
from the drawback that the optimal approximating model

B
tls
might have no input/output i
i
i
i
3.3. Special cases with closed form solutions 37
representation (I/Orepr). In this case (known as nongeneric TLS problem), the optimiza-
tion problem (TLS
X
) has no solution. By suitable permutation of the variables, however,
(TLS
X
) can be made solvable, so that

X
tls
exists and

B
tls
= B
i/o
(

X
tls
).
The issue of whether the TLS problem is generic or not is not related to the ap-
proximation of the data per se but to the possibility of representing the optimal
model

B
tls
in the form (I/Orepr), i.e., to the possibility of imposing a given in-
put/output partition on

B
tls
.
The solution of the GTLS problemcan be obtained fromthe solution of a TLS problem
for a modied data matrix. In fact, with the same transformation technique, we can solve
a more general WTLS problem than the previously dened GTLS problem. Dene the
following mist criterion:
M
gtls2
(D, B) = min

DB
_
_
_
W
l
(D

D)
_
W
r
_
_
F
. (Mgtls2)
With W
l
= W and W
r
= I the mist minimization problem

B
gtls2
= arg min

BL
d
m,0
M
gtls2
(D,

B) (GTLS2)
reduces to the previously dened GTLS problem. The right weight matrix W
r
, however,
gives additional degrees of freedom for choosing an appropriate weighting pattern.
Note 3.17 (EWTLS with rank-one weight matrix ) Let W
l
= diag(w
l
) andW
r
= diag(w
r
),
where w
l
R
d
+
and w
r
R
N
+
are given vectors with positive elements. Then
M
gtls2
(D, B) = min

DL
d
m,0
|
gtls2
(D

D)|
F
, where
gtls2
= w
l
w

r
;
i.e., the GTLS problem (GTLS2) with diagonal weight matrices W
l
and W
r
corre-
sponds to an EWTLS problem with rank-one weight matrix .
Theorem 3.18 (Solution of the GTLS problem). Dene the modied data matrix
D
m
:=
_
W
l
D
_
W
r
,
and let

D
m,tls
,

B
m,tls
= ker(R
m,tls
) = col span(P
m,tls
) be a TLS approximation of D
m
in L
d
m,0
. Then a solution of the GTLS problem (GTLS2) is

D
gtls2
=
_
_
W
l
_
1

D
m,tls
_
_
W
r
_
1
,

B
gtls2
= ker
_
R
m,tls
_
W
l
_
= col span
_
_
_
W
l
_
1
P
m,tls
_
,
and the corresponding GTLS mist is
_
_
D

D
gtls2
_
_
F
=
_
_
D
m


D
m,tls
_
_
F
. i
i
i
i
38 Chapter 3. Weighted Total Least Squares
A GTLS solution always exists. It is unique if and only if

B
m,tls
is unique.
Proof. The cost function of the GTLS problem (GTLS2) can be written as
_
_
_
W
l
(D

D)
_
W
r
_
_
F
=
_
_
D
m

_
W
l

D
_
W
r
. .

D
m
_
_
F
=:
_
_
D
m


D
m
_
_
F
,
which is the cost function of a TLS problem for the modied data matrix D
m
. Because the
mapping

D
m


D, dened by

D
m
=

W
l

D

W
r
, is one-to-one, the above transformation
shows that the GTLS problem for D is equivalent to the TLS problem for D
m
. The GTLS
solution

D
gtls2
is recovered from the TLS solution

D
m,tls
by the inverse transformation

D
gtls2
= (

W
l
)
1

D
m,tls
(

W
r
)
1
. We have
B
gtls2
= col span
_

D
gtls2
_
= col span
_
_
_
W
l
_
1

D
m,tls
_
= col span
_
_
_
W
l
_
1
P
m,tls
_
,
and it follows that B
gtls2
= ker
_
R
m,tls

W
l
_
.
3.4 Mist Computation
The WTLS problem is a double minimization problem with an inner minimization, the
search for the best approximation of the data in a given model, and an outer minimiza-
tion, the search for the model. First, we solve the inner minimization problem: the mist
computation (Mwtls).
Since the model is linear, (Mwtls) is a convex quadratic optimization problem with a
linear constraint. Therefore, it has an analytic solution. In order to give explicit formulas for
the optimal approximation

D
wtls
and M
wtls
(D, B), however, we need to choose a particular
parameterization of the given model B. Three parameterizationskernel, image, and
input/outputare described in Section 3.2. We state the results for the kernel and the image
representations. The results for the input/output representation follow from the given ones
by the substitutions R
_
X

and P
_
I
X

.
Theorem 3.19 (WTLS mist computation, kernel representation version). Let ker(R)
be a minimal kernel representation of B L
d
m,0
. The best WTLS approximation of D in B,
i.e., the solution of (Mwtls), is

d
wtls,i
=
_
I W
1
i
R

(RW
1
i
R

)
1
R
_
d
i
, for i = 1, . . . , N,
with the corresponding mist
M
wtls
_
D, ker(R)
_
=

_
N

i=1
d

i
R

(RW
1
i
R

)
1
Rd
i
. (Mwtls
R
)
Proof. Dene the correction D := D

D. The mist computation problem (Mwtls) is
equivalent to
min
d
1
,...,d
N
N

i=1
d

i
W
i
d
i
subject to R(d
i
d
i
) = 0, for i = 1, . . . , N. i
i
i
i
3.4. Mist computation 39
Observe that this is a separable weighted least norm problem; i.e., it involves N indepen-
dent weighted least norm subproblems. Dene E := RD and let E =:
_
e
1
e
N

.
Consider the ith subproblem
min
d
i
d

i
W
i
d
i
subject to Rd
i
= e
i
.
Its solution is
d

i
= W
1
i
R

(RW
1
i
R

)
1
Rd
i
,
so that the squared minimum mist is
M
2
wtls
(D, B) =
N

i=1
d

i
W
i
d

i
=
N

i=1
d

i
R

(RW
1
i
R

)
1
Rd
i
.
Next, we state the result in the special case of a GTLS problem.
Corollary 3.20 (GTLS mist computation, kernel representation version). Let ker(R)
be a minimal kernel representation of B L
d
m,0
. The best GTLS approximation of D
in B is

D
gtls
=
_
I W
1
R

(RW
1
R

)
1
R
_
D,
with the corresponding mist
M
gtls
_
D, ker(R)
_
=
_
trace (D

(RW
1
R

)
1
RD). (Mgtls
R
)
The image representation is dual to the kernel representation. Correspondingly, the
mist computation with kernel and with image representations of the model are dual prob-
lems. The kernel representation leads to a weighted least norm problem and the image
representation leads to an WLS problem.
Theorem3.21(WTLSmist computation, image representationversion). Let col span(P)
be a minimal image representation of B L
d
m,0
. The best WTLS approximation of Din Bis

d
wtls,i
= P(P

W
i
P)
1
P

W
i
d
i
, for i = 1, . . . , N,
with the corresponding mist
M
wtls
_
D, col span(P)
_
=

_
N

i=1
d

i
W
i
_
I P(P

W
i
P)
1
P

W
i
_
d
i
. (Mwtls
P
)
Proof. In terms of the image representation

D = PL, with L =:
_
l
1
l
N

, prob-
lem (Mwtls) is equivalent to
min
l
1
,...,l
N
N

i=1
(d
i


d
i
)

W
i
(d

d
i
) subject to

d
i
= Pl
i
, for i = 1, . . . , N, i
i
i
i
40 Chapter 3. Weighted Total Least Squares
which is a separable WLS problem. The solution of the ith subproblem
min
l
i
(d
i


d
i
)

W
i
(d

d
i
) subject to

d
i
= Pl
i
is l

i
= (P

W
i
P)
1
P

W
i
d
i
, so that

d

i
= P(P

W
i
P)
1
P

W
i
d
i
.
Corollary3.22(GTLSmist computation, image representationversion). Let col span(P)
be a minimal image representation of B L
d
m,0
. The best GTLS approximation of Din Bis

D
gtls
= P(P

WP)
1
P

WD,
with the corresponding minimum value of the mist function
M
gtls
_
D, col span(P)
_
=
_
trace
_
D

W
_
I P(P

WP)
1
P

W
_
D
_
. (Mgtls
P
)
3.5 Mist Minimization

In Section 3.4, we solved the inner minimization problem of the WTLS problemmist
computation. Now we consider the remaining problemthe minimization with respect to
the model parameters. This is a nonconvex optimization problem that in general has no
closed form solution. For this reason, numerical optimization methods are employed for its
solution. First, we review the methods proposed in the literature. Then we present in detail
three algorithms. In Section 3.6, we compare their performance on test examples.
Algorithms Proposed in the Literature
Special optimization methods for the WTLS problem are proposed in [DM93, WAH
+
97,
PR02, MMH03]. The Riemannian singular value decomposition (RiSVD) framework of
De Moor [DM93] is derived for the STLS problem and includes the EWTLS problem with
complexity specication m = d1 as a special case. The restriction to more general WTLS
problems comes fromthe fact that the RiSVDframeworkis derivedfor matrixapproximation
problems with rank reduction by one and with diagonal weight matrices. In [DM93], an
algorithm resembling the inverse power iteration algorithm is proposed for computing the
RiSVD. The method, however, has no proven convergence properties.
The maximumlikelihood principle component analysis (MLPCA) method of Wentzell
et al. [WAH
+
97] is an alternating least squares algorithm. It applies to the general WTLS
problems and is globally convergent. The convergence rate, however, is linear and the
method can be rather slow in practice.
The method of Premoli and Rastello [PR02] is a heuristic for solving the rst order
optimality condition of (WTLS). A solution of a nonlinear equation is searched for instead
of a minimum point of the original optimization problem. The method is locally convergent
with superlinear convergence rate. The method is not globally convergent and the region of
convergence around a minimum point can be rather small in practice.
The weighted low-rank approximation (WLRA) framework of Manton, Mahony,
and Hua et al. [MMH03] proposes specialized optimization methods on a Grassman man-
ifold. The least squares nature of the problem is not exploited by the algorithms proposed
in [MMH03]. i
i
i
i
3.5. Mist minimization

41
Table 3.1. Model representations and optimization algorithms used in the methods of
[DM93, WAH
+
97, PR02, MMH03].
Method Representation Algorithm
RiSVD kernel inverse power iteration
MLPCA image alternating projections
PR input/output iteration based on heuristic linearization
WLRA kernel Newton method
The RiSVD, MLPCA, PremoliRastello (PR), and WLRA methods differ in the
parameterization of the model and the optimization algorithm that are used.
Table 3.1 summarizes the model parameterizations and optimization algorithms for the
different methods.
Alternating Least Squares Algorithm
The alternating least squares method is based on an image representation of the model, i.e.,
B = col span(P), where P R
dm
. First we rewrite (WTLS) in the form
min

DBL
d
m,0
_
vec

(D

D)W vec(D

D), where W := diag(W
1
, . . . , W
N
).
The constraint

D B is equivalent to

D = PL, where L R
mN
, so that (WTLS) can
further be written as
min
PR
dm
min
LR
mN
vec

(D PL)W vec(D PL). (WTLS


P
)
The two problems
min
PR
dm
vec

(D PL)W vec(D PL), (RLX1)


min
LR
mN
vec

(D PL)W vec(D PL), (RLX2)


derived from (WTLS
P
) by xing, respectively, L and P to given values, are WLS problems
and therefore can be solved in closed form. They can be viewed as relaxations of the non-
convex problem (WTLS
P
) to convex problems. Note that (RLX2) is the mist computation
of (WTLS
P
) and the solution for the case, where W is block-diagonal is (Mwtls
P
).
The alternating least squares method is an iterative method that alternatively solves
(RLX1) and (RLX2) with, respectively, L and P xed to the solution of the previously
solved relaxation problem. The resulting algorithm is Algorithm 3.1.
Algorithm 3.1 is given for an arbitrary positive denite weight matrix W. When W
is block-diagonal (WTLS problem) or diagonal (EWTLS problem), Algorithm 3.1 can be
implemented more efciently, taking into account the structure of W. For example, in the
WTLS case, the solution of problem (RLX1) can be computed efciently as follows:
l
i
= (P

W
i
P)
1
P

W
i
d
i
, for i = 1, . . . , N, i
i
i
i
42 Chapter 3. Weighted Total Least Squares
Algorithm 3.1 Alternating least squares algorithm for WTLS problem wtlsap
Input: data matrix D R
dN
, weight matrix W R
NdNd
, complexity specication m
for the WTLS approximation, and relative convergence tolerance .
1: Initial approximation: compute a TLS approximation of D in L
d
m,0
and let P
(0)
:=

P
tls
,
D
(0)
:=

D
tls
, L
(0)
:=

L
tls
, where

L
tls
is the matrix, such that

D
tls
=

P
tls

L
tls
.
2: k := 0.
3: repeat
4: Compute the solution of (RLX1)
vec(L
(k+1)
) =
_
P
(k)
WP
(k)
_
1
P
(k)
W vec(D),
where P
(k)
= I
N
P
(k)
.
5: M
(k)
wtls
=
_
vec

(D P
(k)
L
(k+1)
)W vec(D P
(k)
L
(k+1)
).
6: k = k + 1.
7: Compute the solution of (RLX2)
vec(P
(k)
) =
_
L
(k)
WL
(k)
_
1
L
(k)
W vec(D),
where L
(k)
= L
(k)
I
d
.
8: M
(k)
wtls
=
_
vec

(D P
(k)
L
(k)
)W vec(D P
(k)
L
(k)
).
9: until [M
(k)
wtls
M
(k1)
wtls
[/M
(k)
wtls
< .
Output:

P
wtls
:= P
(k)
and

D
wtls
= P
(k)
L
(k)
.
where l
i
is the ith column of L. Similarly, (Mwtls
P
) takes into account the block-diagonal
structure of W for efcient solution of problem (RLX2).
The alternating least squares algorithm monotonically decreases the cost function
value, so that it is globally convergent. The convergence rate, however, is linear and depends
on the distribution of the singular values of D[MMH03, IV.A]. With a pair of singular values
that are close to each other the convergence rate can be rather low.
Note 3.23 (Initial approximation) For simplicity the initial approximation is chosen to be
a TLS approximation of D in L
d
m,0
. The weight matrix W, however, can partially be taken
into account in the computation of the initial approximation. Let the vector w R
dN
+
be
the diagonal of W and let

W R
dN
be dened as

W := vec
1
(w), where the mapping
w

W is denoted by vec
1
. The EWTLS problem with weights
ij
:=
_

W
ij
can be
viewed as an approximation of the WTLS problem that takes into account only the diagonal
elements of W and ignores all off-diagonal elements. In the EIV setting, this is equivalent
to ignoring the cross covariance information.
The solution of the EWTLS problem, however, still requires local optimization meth-
ods that have to be initialized from a given initial approximation. An additional simpli-
cation that results in a GTLS problem and therefore in an exact solution method is to
approximate the matrix by a rank-one matrix; see Note 3.17. i
i
i
i
3.5. Mist minimization

43
Algorithm of Premoli and Rastello
The algorithm of Premoli and Rastello uses the input/output representation B
i/o
(X) of the
model. We have
D B L
d
m,0
AX = B, where D

=:
_
A B

,
where col dim(A) = m, col dim(B) = p, and d = m + p. The WTLS mist as a function
of the parameter X is (see (Mwtls
R
))
M
wtls
(X) =

_
N

i=1
d

i
R

(RW
1
i
R

)
1
Rd
i
, where R :=
_
X

. (Mwtls
X
)
Then (WTLS) becomes the unconstrained optimization problem

X
wtls
= arg min
X
M
wtls
(X). (WTLS
X
)
Dene V
i
:= W
1
i
and the residual matrix
E(X) := AX B, E

(X) =:
_
e
1
(X) e
N
(X)

,
and partition d
i
and V
i
as follows:
d
i
=:
_
a
i
b
i
_
m
p
, V
i
=:
m p
_
V
a,i
V
ab,i
V
ba,i
V
b,i
_
m
p
.
The rst order optimality condition M

wtls
(X) = 0 of (WTLS
X
) is (see Appendix A.1)
2
N

i=1
_
a
i
e

i
(X)
1
i
(X) (V
a,i
X V
ab,i
)
1
i
(X)e
i
(X)e

i
(X)
1
i
(X)
_
= 0,
where

i
(X) := RW
1
i
R

.
We aim to nd a solution of M

wtls
(X) = 0 that corresponds to a solution of the WTLS
problem, i.e., to a global minimum point of M
wtls
.
The algorithm proposed in [PR02] uses an iterative procedure starting from an initial
approximation X
(0)
and generating a sequence of approximations X
(k)
, k = 0, 1, 2, . . .,
that approaches a solution of M

wtls
(X) = 0. The iteration is implicitly dened by the
equation
F(X
(k+1)
, X
(k)
) = 0, (LINRLX)
where
F(X
(k+1)
, X
(k)
) := 2
N

i=1
_
a
i
_
X
(k+1)
a
i
b
i
_

1
i
(X
(k)
)
(V
a,i
X
(k+1)
V
ab,i
)
1
i
(X
(k)
)e
i
(X
(k)
)e

i
(X
(k)
)
1
i
(X
(k)
)
_
. i
i
i
i
44 Chapter 3. Weighted Total Least Squares
Note that F(X
(k+1)
, X
(k)
) is linear in X
(k+1)
, so that X
(k+1)
can be computed in a closed
form as a function of X
(k)
. Equation (LINRLX) with X
(k)
xed can be viewed as a linear
relaxation of the rst order optimality condition of (WTLS
X
), which is a highly nonlinear
equation.
An outline of the PR algorithm is given in Algorithm 3.2. In general, solving the
equation (LINRLX) for X
(k+1)
requires vectorization. The identity vec(AXB) = (B

A) vec(X) is used in order to transform (LINRLX) to the classical system of equations


G(X
(k)
) vec(X
(k+1)
) = h(X
(k)
), where G and h are given in the algorithm.
Algorithm 3.2 Algorithm of Premoli and Rastello for the WTLS problem wtlspr
Input: the data matrix D R
dN
, the weight matrices W
i

N
i=1
, a complexity specica-
tion m for the WTLS approximation, and a convergence tolerance .
1: Initial approximation: compute a TLS approximation B
i/o
(

X
tls
) of D in L
d
m,0
, and let
X
(0)
:=

X
tls
. (See Note 3.23.)
2: Dene: D =:
_
a
1
a
N
b
1
b
N
_
m
p
, where p := d m.
3: k := 0.
4: repeat
5: Let G = 0
mpmp
and h = 0
mp1
.
6: for i = 1, . . . , N do
7: e
i
:= X
(k)
a
i
b
i
.
8: V
1
:= W
1
i
.
9: M
i
:=
_
_
X
(k)
I
_

V
i
_
X
(k)
I
_
_
1
.
10: y
i
:= M
i
e
i
.
11: G = G+M
i
(a
i
a

i
) (y
i
y

i
) V
a,i
.
12: h = h + vec(a
i
b

i
M
i
V
ab,i
y
i
y

i
).
13: end for
14: Solve the system Gx = h and let X
(k+1)
:= vec
1
(x).
15: k := k + 1.
16: until |X
(k)
X
(k1)
|
F
/|X
(k)
|
F
<
Output:

X
wtls
:= X
(k)
.
Note 3.24 (Relation to GaussNewton-type algorithms) Algorithm 3.2 is not a Gauss
Newton-type algorithm for solving the rst order optimality condition because the approx-
imation F is not the rst order truncated Taylor series of M

wtls
; it is a different linear
approximation. The choice makes the derivation of the algorithm simpler but complicates
the convergence analysis.
Note 3.25 (Convergence properties) Algorithm3.2is proventobe locallyconvergent with
a superlinear convergence rate; see [MRP
+
05, Section 5.3]. Moreover, the convergence rate
tends to quadratic as the approximation gets closer to a minimum point. The algorithm,
however, is not globally convergent, and simulation results suggest that the region of conver-
gence to a minimum point could be rather small. This requires a good initial approximation
for convergence. i
i
i
i
3.5. Mist minimization

45
An Algorithm Based on Classical Local Optimization Methods
Both the alternating least squares and the PRalgorithms are heuristic optimization methods.
Next, we describe an algorithmfor the WTLS problembased on classical local optimization
methods. The classical local optimization methods have by now reached a high level of ma-
turity. In particular, their convergence properties are well understood, while the convergence
properties of the RiSVD, MLPCA, and PR methods are still not.
In order to apply a classical optimization algorithm for the solution of (WTLS), rst
we have to choose a parameterization of the model. Possible parameterizations are given
by the kernel, image, and input/output representations. For reasons to be discussed later
(see Note 3.26), we choose the input/output representation (I/Orepr), dened on page 34,
so that the considered problem is (WTLS
X
).
A quasi-Newton-type method requires an evaluation of the cost function M
wtls
(X)
and its rst derivative M

wtls
(X). Both the mist and its rst derivative are available in
closed form, so that their evaluation is a matter of numerical implementation of the involved
operations. The computational steps are summarized in Algorithm 3.3. The proposed
algorithm, based on a classical optimization method, is outlined in Algorithm 3.4.
Algorithm 3.3 WTLS cost function and rst derivative evaluation qncostderiv
Input: D R
dN
, W
i

N
i=1
, m, and X.
1: Dene: D =:
_
a
1
a
N
b
1
b
N
_
m
p
, where p := d m.
2: Let f = 0
11
and f

= 0
mp
.
3: for i = 1, . . . , N do
4: e
i
:= X

a
i
b
i
.
5: V
1
:= W
1
i
.
6: Solve the system
_
_
X
I

V
i
_
X
I

_
y
i
= e
i
.
7: f = f +e

i
y
i
.
8: f

= f

+a
i
y

i
(V
a,i
X V
ab,i
)y
i
y

i
.
9: end for
Output: M
wtls
(X) = f, M

wtls
(X) = 2f

.
Algorithm 3.4 Algorithm for WTLS based on classical local optimization wtlsopt
Input: the data matrix D R
dN
, the weight matrices W
i

N
i=1
, a complexity specica-
tion m for the WTLS model, and a convergence tolerance .
1: Initial approximation: compute a TLS approximation B
i/o
(

X
tls
) of D in L
d
m,0
, and let
X
(0)
:=

X
tls
. (See Note 3.23.)
2: Execute a standard optimization algorithm, e.g., the Broyden, Fletcher, Goldfarb, and
Shanno (BFGS) quasi-Newton method, for the minimization of M
wtls
over X with initial
approximation X
(0)
and with cost function and rst derivative evaluation performed via
Algorithm 3.3. Let

X be the approximation found by the optimization algorithm upon
convergence.
Output:

X
wtls
=

X. i
i
i
i
46 Chapter 3. Weighted Total Least Squares
The optimization problem (WTLS
X
) is a nonlinear least squares problem; i.e.,
M
wtls
(X) = F

(X)F(X)
for certain F : R
mp
R
Np
. Therefore, the use of special optimization methods such as
the LevenbergMarquardt method is preferable. The vector F(X), however, is computed
numerically, so that the Jacobian J(X) :=
_
F
i
/x
j

, where x = vec(X), cannot be found


in closed form. A possible workaround for this problem is proposed in [GP96], where an
approximation called quasi-Jacobian is used instead. The quasi-Jacobian can be evaluated
in a similar way to the one for the gradient, which allows us to use the LevenbergMarquardt
method for the solution of the WTLS problem.
Note 3.26 (Kernel vs. input/output representation) In Note 3.16 we comment that an op-
timal approximation

B might have no input/output representation (I/Orepr). In practice,
even when such a representation exists, the parameter

X might be large, which might cause
numerical problems because of ill conditioning. In this respect the use of a kernel or image
representation is preferable over an input/output representation.
The input/output representation (I/Orepr), however, has the advantage that the param-
eter X is unique, while the parameters R and P in the kernel and image representations are
not. The mist M
wtls
depends only on ker(R) and col span(P) and not on all elements of
R and P. This makes the optimization problems min
R
M
wtls
(R) and min
P
M
wtls
(P) more
difcult than (WTLS
X
). Additional constraints such as RR

= I and P

P = I have to
be imposed and the mist and its derivative have to be derived as a function of a unique
parameterization of ker(R) and col span(P). The mathematical structure appropriate to
treat this type of problem is the Grassman manifold. Classical optimization methods for
optimization over a Grassman manifold are proposed in [MMH03].
3.6 Simulation Examples
We outlined the following algorithms for the WTLS problem:
MLPCAthe alternating least squares algorithm,
PRthe algorithm of Premoli and Rastello,
QNthe algorithm based on a quasi-Newton optimization method, and
LMthe algorithm based on the LevenbergMarquardt optimization method.
In this section, we compare the performance of the algorithms on a simulation example.
The considered model class is L
4
2,0
and the experiment has N = 25 outcomes. The data
D R
425
, used in the example, is simulated according to the WTLS EIV model with
a true model B
i/o
(

X), where

X R
22
is a random matrix, and with random positive
denite weight matrices W
i

25
i=1
.
The algorithms are compared on the basis of the
achieved mist M
wtls
(D,

B), where

B is the computed approximate model;
number of iterations needed for convergence to the specied convergence tolerance; i
i
i
i
3.7. Conclusions 47
Table 3.2. Simulation results comparing four algorithms for the WTLS problem.
Method MLPCA PR QN LM GTLS TLS WLS
mist 0.4687 0.4687 0.4687 0.4687 0.7350 0.7320 1.0500
error 0.2595 0.2582 0.2581 0.2582 0.4673 0.4698 0.5912
# iter. 51 6 5 4
# fun. eval. 18 29
megaops 54 0.06 0.11 0.15 0.010 0.005 0.010
time, sec 2.29 0.40 0.59 0.54 0.004 0.002 0.003
execution time, measured in seconds;
relative error of approximation |

X

X|
F
/|

X|
F
, where

X is such that

B = B
i/o
(

X);
computational cost, measured in oating point operations (ops); and
number of cost function evaluations.
The criteria that are not relevant for a method are marked with . The experiment is
repeated 10 times with different noise realizations and the averaged results are shown in
Table 3.2.
In addition to the four WTLS algorithms, the table shows the results for the GTLS,
TLS, and weighted least squares (WLS) methods. The GTLS method is applied by taking
W to be the average 1/N

N
i=1
W
i
of the weight matrices and the WLS method uses only
the information W
b,i

N
i=1
, where W
i
=:
_
W
a,i
W
ab,i
W
b,i
_
and W
b,i
R
pp
.
The results indicate that although (in this example) the four WTLS algorithms con-
verge to the same local minimum of the cost function, their convergence properties and
numerical efciency are different. In order to achieve the same accuracy, the MLPCA al-
gorithm needs more iterations than the other algorithms and as a result its computational
efciency is lower. The large number of iterations needed for convergence of the MLPCA
algorithm is due to its linear convergence rate. In contrast, the convergence rate of the
other algorithms is superlinear. They need approximately the same number of iterations
and the execution times and numerical efciency are similar. The PR algorithm is about
two times more efcient than the standard local optimization algorithms, but it has no global
convergence property, which is a serious drawback for most applications.
3.7 Conclusions
In this chapter we considered approximate modeling problems for linear static problems.
The most general problem formulation that we treated is the WTLS problem, where the
distance from each of the outcomes to the model is measured by a weighted norm with
possibly different weight matrix. The WTLS problemis motivated by the relative error TLS
and EIV estimation problems, where the weighted mist naturally occurs and the weight
matrices have interpretation, e.g., in the EIV estimation problem, the weight matrices are
related to the covariance matrices of the measurement errors. i
i
i
i
48 Chapter 3. Weighted Total Least Squares
We showed the relations among the kernel, image, and input/output representations.
Except for the nongeneric cases that occur in the input/output representation, one repre-
sentation can be transformed into another one. Thus in mist approximation problems the
choice of the representation is a matter of convenience. Once computed, the approximate
model

B can be transformed to any desired representation. We noted that the numerical
algorithms proposed in the literature for the WTLS problem differ mainly because they use
different model representations.
We showed howthe TLS and GTLS problems are solved by an SVD. In linear algebra
terms, these problems boil down to nding a closest rankdecient matrix to a given matrix.
The main tool for obtaining the solution is the matrix approximation lemma. The solution
always exists but in certain nongeneric cases it could be nonunique.
The numerical solution of the general WTLS problem was approached in two steps:
1. compute analytically the mist and
2. solve numerically the resulting optimization problem.
We showed two different expressions for the WTLS mist: one for a model given by a kernel
representation and the other for a model given by an image representation. In the rst case,
the mist computation problem is a weighted least norm problem, and in the second case,
it is a WLS problem.
Apart from the model representation used, another way of obtaining different com-
putational algorithms is the use of a different optimization method in the second step. We
outlined three algorithms: alternating least squares, the algorithm of Premoli and Rastello,
and an algorithm based on classical local optimization methods. The alternating least
squares algorithmhas a linear convergence rate, which makes it rather slowcompared to the
other two algorithms, whose convergence rate is superlinear. The algorithm of Premoli and
Rastello is not globally convergent, while the other two algorithms have this property. For
these reasons, the algorithm based on standard local optimization methods is recommended
for solving WTLS problems.
The proposed Algorithm 3.4, based on local optimization methods, however, uses the
input/output representation B
i/o
(X) of the model. As already discussed, this representation
is not general; there are cases when the data D is such that the optimal approximation

B
has no input/output representation B
i/o
(

X). Such cases are nongeneric, but even when

X
exists, it might be large, which might cause ill conditioning. Asolution for this problemis to
use a kernel or an image representation of the model. The choice of a kernel representation
leads to the optimization methods presented in [MMH03]. i
i
i
i
Chapter 4
Structured Total Least
Squares
The weighted total least squares problem generalizes the TLS problem by introducing
weights in the mist function. The structured total least squares problemgeneralizes the TLS
problem by introducing a structure in the data matrix. The motivation for the special type
of block-Hankel structure comes from system identication. The global total least squares
problem is closely related to the structured total least squares problem with block-Hankel
structured data matrix.
Section 4.1 gives an overview of the existing literature. Section 4.2 denes the type
of structure we restrict ourselves to and derives an equivalent unconstrained optimization
problem. The data matrix is partitioned into blocks and each of the blocks is block-
Toeplitz/Hankel structured, unstructured, or exact. In Section 4.3, the properties of the
equivalent problem are established. The special structure of the equivalent problem enables
us to improve the computational efciency of the numerical solution methods. By exploiting
the structure, the computational complexity of the algorithms (local optimization methods)
per iteration is linear in the sample size.
4.1 Overview of the Literature
History of the Problem
The origin of the STLS problem dates back to the work of Aoki and Yue [AY70a], although
the name structured total least squares did not appear until 23 years later in the liter-
ature [DM93]. Aoki and Yue consider a single-input single-output system identication
problem, where both the input and the output are noisy (EIV setting) and derive a maxi-
mum likelihood solution. Under the normality assumption for the measurement errors, a
maximum likelihood estimate turns out to be a solution of the STLS problem. Furthermore,
Aoki and Yue approach the optimization problem in a similar way to the one we adopt:
they use classical nonlinear least squares minimization methods for solving an equivalent
unconstrained problem.
The STLS problem occurs frequently in signal processing applications. Cadzow
[Cad88] and Bresler and Macovski [BM86] propose heuristic solution methods that turn
49 i
i
i
i
50 Chapter 4. Structured Total Least Squares
out to be suboptimal [DM94, Section V] with respect to the STLS criterion. These methods,
however, became popular because of their simplicity. For example, the method of Cadzow
is an iterative method that alternates between unstructured low-rank approximation and
structure enforcement, thereby requiring only SVD computations and manipulation of the
matrix entries.
Abatzoglou, Mendel, and Harda [AMH91] are considered to be the rst who formu-
lated an STLS problem. They called their approach constrained total least squares and
motivate the problem as an extension of the TLS method to matrices with structure. The
solution approach adopted in [AMH91] is closely related to the one of Aoki and Yue. Again,
an equivalent optimization problemis derived, but it is solved numerically via a Newton-type
optimization method.
Shortly after the publication of the work on the constrained total least squares problem,
De Moor lists many applications of the STLS problem and outlines a new framework for
deriving analytical properties and numerical methods [DM93]. His approach is based on
the Lagrange multipliers and the basic result is an equivalent problem, called Riemannian
singular value decomposition, that can be considered as a nonlinear extension of the
classical SVD. As an outcome of the newproblemformulation, an iterative solution method
based on the inverse power iteration is proposed.
Another algorithm for solving the STLS problem (even with
1
and

norms in
the cost function), called structured total least norm, is proposed by Rosen, Park, and
Glick [RPG96]. In contrast to the approaches of Aoki and Yue and Abatzoglou et al.,
Rosen et al. solve the problemin its original formulation. The constraint is linearized around
the current iteration point, which results in a linearly constrained least squares problem. In
the algorithm of [RPG96], the constraint is incorporated in the cost function by adding a
multiple of its residual norm.
The weighted low-rank approximation framework of Manton, Mahony, and Hua
[MMH03] has been extended in Schuermans, Lemmerling, and Van Huffel [SLV04, SLV05]
to Hankel structured low-rank approximation problems. All problemformulations and solu-
tion methods cited above, except for the ones in the WLRAframework, aimat rank reduction
of the data matrix C by one. A generalization of the algorithm of Rosen et al. to problems
with rank reduction by more than one is proposed by Van Huffel, Park, and Rosen [VPR96].
It involves, however, Kronecker products that unnecessarily inate the dimension of the
involved matrices. The solution methods in the WLRA framework [SLV04, SLV05] are
also computationally demanding.
When dealing with a general afne structure, the constrained total least squares, Rie-
mannian singular value decomposition, and structured total least norm methods have cubic
computational complexity in the number of measurements. Fast algorithms with linear com-
putational complexity are proposed by Lemmerling, Mastronardi, and Van Huffel [LMV00]
and Mastronardi, Lemmerling, and Van Huffel [MLV00] for special STLS problems with
data matrix S(p) =:
_
A b

that is Hankel or composed of a Hankel block A and an


unstructured column b. They use the structured total least norm approach but recognize that
a matrix appearing in the kernel subproblem of the algorithm has low displacement rank.
This is exploited via the Schur algorithm. i
i
i
i
4.2. The structured total least squares problem 51
Motivation for Our Work
The STLS solution methods outlined above point out the following issues:
structure: the structure specication for the data matrix S(p) varies from general
afne [AMH91, DM93, RPG96] to specic afne, such as Hankel/Toeplitz [LMV00],
or Hankel/Toeplitz block augmented with an unstructured column [MLV00],
rank reduction: all methods, except for the ones of [VPR96, SLV04, SLV05], reduce
the rank of the data matrix by d = 1;
computational efciency: the efciency varies from cubic for the methods that use a
general afne structure to linear for the efcient methods of [LMV00, MLV00] that
use a Hankel/Toeplitz-type structure.
No efcient algorithms exist for problems with block-Hankel/Toeplitz structure and rank
reduction d > 1. In addition, the proposed methods lack a numerically reliable and robust
software implementation that would make possible their use in real-life applications. Due to
the above reasons, the STLS methods, although attractive for theoretical studies and relevant
for applications, did not become popular for solving real-life problems.
The motivation for our work is to make the STLSmethod practically useful by deriving
algorithms that are general enough for various applications and computationally efcient
for real-life examples. We complement the theoretical study by a software implementation.
4.2 The Structured Total Least Squares Problem
The STLS problem
min
p
|p p| subject to rank
_
S( p)
_
n (STLS)
dened in Section 2.7 is a structured low-rank approximation problem. The function S :
R
n
p
R
m(n+d)
, m > n, denes the structure of the data as follows: a matrix C
R
m(n+d)
is said to have structure dened by S if there exists a p R
n
p
, such that
C = S(p). The vector p is called a parameter vector of the structured matrix C.
The aim of the STLS problem is to perturb as little as possible a given parameter
vector pbya vector p, sothat the perturbedstructuredmatrixS(p+p) becomes
rank decient with rank at most n.
A kernel representation of the rank deciency constraint rank
_
S(p)
_
= n yields the
equivalent problem
min
RR

=I
d
min
p
|p p| subject to RS

( p) = 0. (STLS
R
)
In this chapter, we use an input/output representation, so that the considered STLS problem
is dened as follows. i
i
i
i
52 Chapter 4. Structured Total Least Squares
Problem 4.1 (STLS). Given a data vector p R
n
p
, a structure specication S : R
n
p

R
m(n+d)
, and a rank specication n, solve the optimization problem

X
stls
= arg min
X,p
|p| subject to S(p p)
_
X
I
d
_
= 0. (STLS
X
)
Dene the matrices
X
ext
:=
_
X
I
_
and
_
A B

:= C := S(p), where A R
mn
and B R
md
,
and note that CX
ext
= 0 is equivalent to the structured system of equations AX = B.
The STLS problem is said to be afne structured if the function S is afne, i.e.,
S(p) = S
0
+
n
p

i=1
S
i
p
i
, for all p R
n
p
and for some S
i
, i = 1, . . . , n
p
. (4.1)
In an afne STLS problem, the constraint S(p p)X
ext
= 0 is bilinear in the decision
variables X and p.
Lemma 4.2. Let S : R
n
p
R
m(n+d)
be an afne function. Then
S(p p)X
ext
= 0 G(X)p = r(X),
where
G(X) :=
_
vec
_
(S
1
X
ext
)

_
vec
_
(S
n
p
X
ext
)

_
R
mdn
p
, (4.2)
and
r(X) := vec
_
_
S(p)X
ext
_

_
R
md
.
Proof.
S(p p)X
ext
= 0
n
p

i=1
S
i
p
i
X
ext
= S(p)X
ext

n
p

i=1
vec
_
(S
i
X
ext
)

_
p
i
= vec
_
_
S(p)X
ext
_

_
G(X)p = r(X).
Using Lemma 4.2, we rewrite the afne STLS problem as follows:
min
X
_
min
p
|p| subject to G(X)p = r(X)
_
. (4.3)
The inner minimization problem has an analytic solution, which allows us to derive an
equivalent optimization problem. i
i
i
i
4.2. The structured total least squares problem 53
Theorem 4.3 (Equivalent optimization problem for afne STLS). Assuming that n
p

md, the afne STLS problem (4.3) is equivalent to
min
X
f
0
(X), where f
0
(X) := r

(X)

(X)r(X) and (X) := G(X)G

(X).
(4.4)
Proof. Under the assumption n
p
md, the inner minimization problem of (4.3) is equiva-
lent to a least norm problem. Its minimum point (as a function of X) is
p

(X) = G

(X)
_
G(X)G

(X)
_

r(X),
so that
f
0
(X) =
_
p

(X)
_

(X) = r

(X)
_
G(X)G

(X)
_

r(X) = r

(X)

(X)r(X).
The signicance of Theorem 4.3 is that the constraint and the decision variable p in
problem (4.3) are eliminated. Typically, the number of elements nd in X is much smaller
than the number of elements n
p
in the correction p. Thus the reduction in the complexity
is signicant.
The equivalent optimization problem (4.4) is a nonlinear least squares problem, so
that classical optimization methods can be used for its solution. The optimization methods
require a cost function and rst derivative evaluation. In order to evaluate the cost function f
0
for a given value of the argument X, we need to form the weight matrix (X) and to solve
the system of equations (X)y(X) = r(X). This straightforward implementation requires
O(m
3
) oating point operation (ops). For large m (the applications that we aim at) this
computational complexity becomes prohibitive.
It turns out, however, that for special afne structures S, the weight matrix (X)
has a block-Toeplitz and block-banded structure, which can be exploited for efcient cost
function and rst derivative evaluations. The set of structures of S for which we establish
the special properties of (X) is specied next.
Assumption4.4(Flexible structure specication). The structure specicationS : R
n
p

R
m(n+d)
is such that for all p R
n
p
, the data matrix S(p) =: C =:
_
A B

is of the
type S(p) =
_
C
1
C
q

, where C
l
, for l = 1, . . . , q, is block-Toeplitz, block-Hankel,
unstructured, or exact and all block-Toeplitz/Hankel structured blocks C
l
have equal row
dimension K of the blocks.
Assumption 4.4 says that S(p) is composed of blocks, each one of which is block-
Toeplitz, block-Hankel, unstructured, or exact. A block C
l
that is exact is not modied in
the solution

C := S(p p), i.e.,

C
l
= C
l
. Assumption 4.4 is the essential structural
assumption that we impose on the problem (STLS
X
). As shown in Section 4.6, it is fairly
general and covers many applications. i
i
i
i
54 Chapter 4. Structured Total Least Squares
Example 4.5 Consider the following block-Toeplitz matrix:
C =
_

_
5
6
3
4
1
2
7
8
5
6
3
4
9
10
7
8
5
6
_

_
with row dimension of the block K = 2. Next, we specify the matrices S
i
that dene
via (4.1) an afne function S, such that C = S(p) for certain parameter vector p. Let ==
be the element-wise comparison operator
(A==B) := C, for all A, B R
mn
, where C
ij
:=
_
1 if A
ij
= B
ij
,
0 otherwise.
Let E be the 6 3 matrix with all elements equal to 1 and dene S
0
:= 0
63
and S
i
:=
(C==iE), for i = 1, . . . , 10. We have
C =
10

i=1
S
i
i = S
0
+
10

i=1
S
i
p
i
=: S(p), with p =
_
1 2 10

.
The matrix C considered in the example is special; it allowed us to easily write down a
corresponding afne function S. Clearly with the constructed S, any 63 block-Toeplitz
matrix C with row dimension of the block K = 2 can be written as C = S(p) for
certain p R
10
.
We use the notation n
l
for the number of block columns of the block C
l
. For unstruc-
tured and exact blocks, n
l
:= 1.
4.3 Properties of the Weight Matrix

For the evaluation of the cost function f


0
of the equivalent optimization problem (4.4), we
have to solve the system of equations (X)y(X) = r(X), where (X) R
mdn
p
with
both m and n
p
large. In this section, we investigate the structure of the matrix (X). In
the notation, occasionally we drop the explicit dependence of r and on X.
Theorem 4.6 (Structure of the weight matrix ). Consider the equivalent optimization
problem (4.4) from Theorem 4.3. If, in addition to the assumptions of Theorem 4.3, the
structure S is such that Assumption 4.4 holds, then the weight matrix (X) has the block- i
i
i
i
4.3. Properties of the weight matrix

55
banded and block-Toeplitz structure
(X) =
_

s
0

1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

s
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1
0
s

1

0
_

_
R
mdmd
, (4.5)
where
k
R
dKdK
, for k = 0, 1, . . . , s, and s = max
l=1,...,q
(n
l
1), where n
l
is the
number of block columns in the block C
l
of the data matrix S(p).
The proof is developed in a series of lemmas. First, we reduce the original problem
with multiple blocks C
l
(see Assumption 4.4) to three independent problemsone for the
unstructured case, one for the block-Hankel case, and one for the block-Toeplitz case.
Lemma 4.7. Consider a structure specication of the form
S(p) =
_
S
1
(p
1
) S
q
(p
q
)

, p
l
R
n
p,l
,

q
l=1
n
p,l
=: n
p
,
where p =: col(p
1
, . . . , p
q
) and S(p
l
) := S
l
0
+

n
p,l
i=1
S
l
i
p
l
i
, for all p
l
R
n
p,l
, l = 1, . . . , q.
Then
(X) =
q

l=1

l
(X), (4.6)
where
l
:= G
l
(G
l
)

, G
l
:=
_
vec
_
(S
l
1
X
l
ext
)

_
vec
_
(S
l
n
p,l
X
l
ext
)

_
, and
X
ext
=: col(X
1
ext
, . . . , X
q
ext
), with X
l
ext
R
n
l
d
,

q
l=1
n
l
= n +d.
Proof. The result is a renement of Lemma 4.2. Let p =: col(p
1
, . . . , p
q
), where
p
l
R
n
p,l
, for l = 1, . . . , q. We have
S(p p)X
ext
= 0

q
l=1
S
l
(p
l
p
l
)X
l
ext
= 0

q
l=1

n
p
i=1
S
l
i
p
l
i
X
l
ext
= S(p)X
ext

q
l=1
G
l
p
l
= r(X)

_
G
1
G
q

. .
G(X)
p = r(X),
so that = GG

q
l=1
G
l
(G
l
)

q
l=1

l
.
Next, we establish the structure of for an STLS problem with unstructured data
matrix. i
i
i
i
56 Chapter 4. Structured Total Least Squares
Lemma 4.8. Let
S(p) :=
_

_
p
1
p
2
p
n+d
p
n+d+1
p
n+d+2
p
2(n+d)
.
.
.
.
.
.
.
.
.
p
(m1)(n+d)+1
p
(m1)(n+d)+2
p
m(n+d)
_

_
R
m(n+d)
.
Then
= I
m
(X

ext
X
ext
); (4.7)
i.e., the matrix has the structure (4.5) with s = 0 and
0
= I
K
(X

ext
X
ext
).
Proof. We have
S(p p)X
ext
= 0 vec
_
X

ext
S

(p)
_
= vec
_
_
S(p)X
ext
_

_
(I
m
X

ext
)
. .
G(X)
vec
_
S

(p)
_
. .
p
= r(X).
Therefore, = GG

= (I
m
X

ext
)(I
m
X

ext
)

= I
m
(X

ext
X
ext
).
Next, we establish the structure of for an STLS problem with block-Hankel data
matrix.
Lemma 4.9. Let
S(p) :=
_

_
C
1
C
2
C
n
C
2
C
3
C
n+1
.
.
.
.
.
.
.
.
.
C
m
C
m+1
C
m+n1
_

_
R
m(n+d)
,
n :=
n +d
L
,
m :=
m
K
,
where C
i
are K L unstructured blocks, parameterized by p
(i)
R
KL
as follows:
C
i
:=
_

_
p
(i)
1
p
(i)
2
p
(i)
L
p
(i)
L+1
p
(i)
L+2
p
(i)
2L
.
.
.
.
.
.
.
.
.
p
(i)
(K1)L+1
p
(i)
(K1)L+2
p
(i)
KL
_

_
R
KL
.
Dene a partitioning of X
ext
as X

ext
=:
_
X
1
X
n

, where X
j
R
dL
. Then has
the block-banded and block-Toeplitz structure (4.5) with s = n 1 and with

k
=
nk

j=1
X
j
X

j+k
, where X
k
:= I
K
X
k
. (4.8)
Proof. Dene the residual R := S(p)X
ext
and the partitioning R

=:
_
R
1
R
m

, i
i
i
i
4.3. Properties of the weight matrix

57
where R
i
R
dK
. Let C := S(p), with block entries C
i
. We have
S(p p)X
ext
= 0 S(p)X
ext
= S(p)X
ext

_
X
1
X
2
X
n
X
1
X
2
X
n
.
.
.
.
.
.
.
.
.
X
1
X
2
X
n
_

_
_

_
C

1
C

2
.
.
.
C

m+n1
_

_
=
_

_
R

1
R

2
.
.
.
R

m
_

_
X
1
X
2
X
n
X
1
X
2
X
n
.
.
.
.
.
.
.
.
.
X
1
X
2
X
n
_

_
. .
G(X)
_

_
vec(C

1
)
vec(C

2
)
.
.
.
vec(C

m+n1
)
_

_
. .
p
=
_

_
vec(R

1
)
vec(R

2
)
.
.
.
vec(R

m
)
_

_
. .
r(X)
.
Therefore, = GG

has the structure (4.5), with


k
s given by (4.8).
The derivation of the matrix for an STLS problem with block-Toeplitz data matrix
is analogous to the one for an STLS problem with block-Hankel data matrix. We state the
result in the next lemma.
Lemma 4.10. Let
S(p) :=
_

_
C
n
C
n1
C
1
C
n+1
C
n
C
2
.
.
.
.
.
.
.
.
.
C
m+n1
C
m+n2
C
m
_

_
R
m(n+d)
,
with the blocks C
i
dened as in Lemma 4.9. Then has the block-banded and block-Toeplitz
structure (4.5) with s = n 1 and

k
=
n

j=k+1
X
j
X

jk
. (4.9)
Proof. Following the same derivation as in the proof of Lemma 4.9, we nd that
G =
_

_
X
n
X
n1
X
1
X
n
X
n1
X
1
.
.
.
.
.
.
.
.
.
X
n
X
n1
X
1
_

_
.
Therefore, = GG

has the structure (4.5), with


k
s given by (4.9).
Proof of Theorem4.6. Lemmas 4.74.10 showthat the weight matrix for the original
problem has the block-banded and block-Toeplitz structure (4.5) with s = max
l=1,...,q
(n
l
1), where n
l
is the number of block columns in the lth block of the data matrix. i
i
i
i
58 Chapter 4. Structured Total Least Squares
Apart from revealing the structure of , the proof of Theorem 4.6 gives an algorithm
for the construction of the blocks
0
, ,
s
that dene :

k
=
q

l=1

l
k
, where
l
k
=
_

n
l
j=k+1
X
l
j
(X
l
jk
)

if C
l
is block-Toeplitz,

n
l
k
j=1
X
l
j
(X
l
j+k
)

if C
l
is block-Hankel,

k
I
K

_
(X
l
ext
)

X
l
ext
_
if C
l
is unstructured,
0
dK
if C
l
is exact,
(4.10)
where is the Kronecker delta function:
0
= 1 and
k
= 0 for k ,= 0.
Corollary 4.11 (Positive deniteness of the weight matrix ). Assume that the structure
of S is given by Assumption 4.4 with the block C
q
being block-Toeplitz, block-Hankel, or
unstructured and having at least d columns. Then the matrix (X) is positive denite for
all X R
nd
.
Proof. We will show that
q
(X) > 0 for all X R
nd
. From (4.6), it follows that has
the same property. By the assumption col dim(C
q
) d, it follows that X
q
ext
= [

I
d
], where
the denotes a block (possibly empty) depending on X. In the unstructured case,
q
=
I
m

_
(X
q
ext
)

X
q
ext
_
; see (4.10). But rank
_
(X
q
ext
)

X
q
ext
_
= d, so that
q
is nonsingular. In
the block-Hankel/Toeplitz case, G
q
is block-Toeplitz and block-banded; see Lemmas 4.9
and 4.10. One can verify by inspection that, independent of X, G
q
(X) has full row rank
due to its row echelon form. Then
q
= G
q
(G
q
)

> 0.
The positive deniteness of is studied in a statistical setting in [KMV05, Section 4],
where more general conditions are given. The restriction of Assumption 4.4 that ensures
> 0 is fairly minor, so that in what follows we will consider STLS problems of this type
and replace the pseudoinverse in (4.4) with the inverse.
In the next section, we give an interpretation of the result from a statistical point of
view, and in Section 4.5, we consider in more detail the algorithmic side of the problem.
4.4 Stochastic Interpretation

Our work on the STLS problem has its origin in the eld of estimation theory. Consider the
EIV model
AX B, where A =

A+

A, B =

B +

B, and

A

X =

B. (EIV
X
)
The data A and B is obtained from true values

A and

B with measurement errors

A and

B
that are zero mean random matrices. Dene the extended matrix

C :=
_

A

B

and the
vector c := vec(

C

) of the measurement errors. It is well known (see [VV91, Chapter 8])


that the TLS problem (TLS
X
) provides a consistent estimator for the true value of the
parameter

X in the model (EIV
X
) if cov( c) =
2
I (and additional technical conditions are
satised). If in addition to cov( c) =
2
I, c is normally distributed, i.e., c N(0,
2
I),then
the solution

X
tls
of the TLS problem is the maximum likelihood estimate of

X.
The model (EIV
X
) is called structured EIV model if the observed data C and the true
value

C :=
_

A

B

have a structure dened by a function S. Therefore,


C = S(p) and

C = S( p), i
i
i
i
4.4. Stochastic interpretation

59
where p R
n
p
is a true value of the parameter p. As a consequence, the matrix of
measurement errors is also structured. Let S be afne and dened by (4.1). Then

C =
n
p

i=1
S
i
p
i
and p = p + p,
where the random vector p represents the measurement error on the structure parameter p.
In [KMV05], it is proven that the STLS problem (STLS
X
) provides a consistent estimator
for the true value of the parameter

X if cov( p) =
2
I (and additional technical conditions
are satised). If p N(0,
2
I) then a solution

X of the STLS problem is a maximum
likelihood estimate of

X.
Let r(X) := vec
_
S( p)X
ext
_
be the random part of the residual r.
In the stochastic setting, the weight matrix is, up to the scale factor
2
, equal to
the covariance matrix V
r
:= cov( r).
Indeed, r = G p, so that
V
r
:= E( r r

) = G E( p p

)G

=
2
GG

=
2
.
Next, we show that the structure of is in a one-to-one correspondence with the
structure of V
c
:= cov( c). Let
ij
R
dKdK
be the (i, j)th block of and let V
c,ij

R
(n+d)K(n+d)K
be the (i, j)th block of V
c
. Dene also the following partitionings of the
vectors r and c:
r =: col(r
1
, . . . , r
m
), r
i
R
dK
and c =: col(c
1
, . . . , c
m
), c
i
R
(n+d)K
,
where m := m/K. Using r
i
= X
ext
c
i
, where X
ext
:= (I
K
X

ext
), we have

ij
= E(r
i
r

j
) = X
ext
E(c
i
c

j
)X

ext
= X
ext
V
c,ij
X

ext
. (4.11)
The one-to-one relation between the structures of and V
c
allows us to relate the
structural properties of , established in Theorem 4.6, with statistical properties of the
measurement errors. Dene stationarity and s-dependence of a centered sequence of random
vectors c := c
1
, c
2
, . . ., c
i
R
(n+d)K
as follows:
c is stationary if the covariance matrix V
c
is block-Toeplitz with block size
(n +d)K (n +d)K, and
c is s-dependent if the covariance matrix V
c
is block-banded with block size
(n +d)K (n +d)K and block bandwidth 2s + 1.
The sequence of measurement errors c
1
, . . . , c
m
being stationary and s-dependent
corresponds to being block-Toeplitz and block-banded. i
i
i
i
60 Chapter 4. Structured Total Least Squares
The statistical setting gives an insight into the relation between the structure of the
weight matrix and the structure of the data matrix C. It can be veried that the structure
specication of Assumption 4.4 implies stationarity and s-dependen for c. This indicates
an alternative (statistical) proof of Theorem 4.6.
The blocks of are quadratic functions of X,
ij
(X) = X
ext
W
c,ij
X

ext
, where
W
c,ij
:= V
c,ij
/
2
; see (4.11). Moreover, by Theorem 4.6, we have that under Assump-
tion 4.4, W
c,ij
= W
c,|ij|
, for certain matrices W
c,k
, k = 1, . . . , m, and W
c,ij
= 0, for
[i j[ > s, where s is dened in Theorem 4.6. Therefore,

k
(X) = X
ext
W
c,k
X

ext
, for k = 1, . . . , s, where W
c,k
:=
1

2
V
c,k
.
In (4.10) we showhowthe matrices
k

s
k=0
are determined fromthe structure specication
of Assumption 4.4. Nowwe give the corresponding expressions for the matrices W
c,k

s
k=0
:
W
c,k
:= diag(W
1
k
, . . . , W
q
k
), and W
l
k
=
_

_
(J

n
l
K
)
t
l
k
if C
l
is block-Toeplitz,
(J
n
l
K
)
t
l
k
if C
l
is block-Hankel,

k
I
n
l
K
if C
l
is unstructured,
0
n
l
K
if C
l
is exact,
(4.12)
where J
n
l
is the n
l
n
l
shift matrix
J
n
l
:=
_

_
0 1 0 0
0 0 1 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 1
0 0 0 0
_

_
R
n
l
n
l
.
In the computational algorithm described in Section 4.5, we use the partitioning of
the matrix into blocks of size d d. Let
ij
R
dd
be the (i, j)th block of and let
V
c,ij
R
(n+d)(n+d)
be the (i, j)th block of V
c
. Dene the following partitionings of the
vectors r and c:
r =: col( r
1
, . . . , r
m
), r
i
R
d
and c =: col( c
1
, . . . , c
m
), c
i
R
n+d
.
Using r
i
= X

ext
c
i
, we have

ij
=
1

2
E( r
i
r

j
) =
1

2
X

ext
E( c
i
c

j
)X
ext
=
1

2
X

ext
V
c,ij
X
ext
=: X

ext
W
c,ij
X
ext
.
4.5 Efcient Cost Function and First Derivative
Evaluation

We consider an efcient numerical method for solving the STLS problem (STLS
X
) by
applying standard local optimization algorithms to the equivalent problem (4.4). With
this approach, the main computational effort is in the cost function and its rst derivative
evaluation. i
i
i
i
4.5. Efcient cost function and rst derivative evaluation

61
First, we describe the evaluation of the cost function: given X, compute f
0
(X). For
given X, and with
k

s
k=0
constructed according to (4.12), the weight matrix (X) is
specied. Then, from the solution of the system (X)y
r
(X) = r(X), the cost function is
found as f
0
(X) = r

(X)y
r
(X).
The properties of (X) can be exploited in the solution of the system y
r
= r.
The subroutine MB02GD from the SLICOT library [VSV
+
04] exploits both the block-
Toeplitz and the block-banded structure to compute a Cholesky factor of in O((dK)
2
sm)
ops. In combination with the LAPACK subroutine DPBTRS that solves block-banded
triangular systems of equations, the cost function is evaluated in O(m) ops. Thus an
algorithm for local optimization that uses only cost function evaluations has computational
complexity O(m) ops per iteration, because the computations needed internally for the
optimization algorithm do not depend on m.
Next, we describe the evaluation of the derivative. The derivative of the cost func-
tion f
0
is (see Appendix A.2)
f

0
(X) = 2
m

i,j=1
a
j
r

i
(X)M
ij
(X) 2
m

i,j=1
_
I 0

W
c,ij
_
X
I
_
N
ji
(X), (4.13)
where A

=:
_
a
1
a
m

, with a
i
R
n
,
M(X) :=
1
(X), N(X) :=
1
(X)r(X)r

(X)
1
(X),
and M
ij
R
dd
, N
ij
R
dd
are the (i, j)th blocks of M and N, respectively.
Consider the following two partitionings of y
r
R
md
:
y
r
=: col(y
r,1
, . . . , y
r,m
), y
r,i
R
d
and y
r
=: col(y
r,1
, . . . , y
r,m
), y
r,i
R
dK
,
(4.14)
where m := m/K. The rst sum in (4.13) becomes
m

i,j=1
a
j
r

i
M
ij
= A

Y
r
, where Y

r
:=
_
y
r,1
y
r,m

. (4.15)
Dene the sequence of matrices
N
k
:=
mk

i=1
y
r,i+k
y

r,i
, N
k
= N

k
, k = 0, . . . , s.
The second sum in (4.13) can be written as
m

i,j=1
_
I 0

W
c,ij
_
X
I
_
N
ji
=
s

k=s
K

i,j=1
(W
a,k,ij
X W
a

b,k,ij
)N

k,ij
,
where W
c,k,ij
R
(n+d)(n+d)
is the (i, j)th block of W
c,k
R
K(n+d)K(n+d)
, W
a,k,ij

R
nn
and W
a

b,k,ij
R
nd
are dened as blocks of W
c,k,ij
as
W
c,k,ij
=:
_
W
a,k,ij
W
a

b,k,ij
W

ba,k,ij
W

b,k,ij
_
, i
i
i
i
62 Chapter 4. Structured Total Least Squares
and N
k,ij
R
dd
is the (i, j)th block of N
k
R
dKdK
.
Thus the evaluation of the derivative f

0
(X) uses the solution of y
r
= r, already
computed for the cost function evaluation and additional operations of O(m) ops. The
steps described above are summarized in Algorithms 4.1 and 4.2.
The structure of S() is specied by the integer K, the number of rows in
a block of a block-Toeplitz/Hankel structured block C
(i)
, and the array S
(T, H, U, E N N)
q
that describes the structure of the blocks C
(i)

q
i=1
.
The ith element S
i
of the array S species the block C
(i)
by giving its type S
i
(1), the
number of columns n
i
= S
i
(2), and (if C
(i)
is block-Hankel or block-Toeplitz) the column
dimension t
i
= S
i
(3) of a block in C
(i)
. Therefore, the input data for the STLS problem is
the data matrix S(p) (alternatively the parameter vector p) and the structure specication
K and S.
Algorithm 4.1 outlines the steps for the construction of the W
c,k
matrices. It re-
quires arithmetic operation only for indexing matrix-vector elements. The s + 1 matrices
W
c,k

s
k=0
are sparse. For the typical applications that we address, however, their dimen-
sion (n + d)K (n + d)K is relatively small (compared to the row dimension m of the
data matrix), so that we do not take into account their structure.
Algorithm 4.1 From structure specication K, S to W
c,k
decode_struct
Input: structure specication K, S.
1: Dene s := max
l=1,...,q
(n
l
1), where n
l
:= n
l
/t
l
, for block-Toeplitz/Hankel struc-
tured block C
l
, and n
l
:= 1, otherwise.
2: for k = 1, . . . , s do
3: for l = 1, . . . , q do
4: if S
l
(1)==T then
5: W
l
k
= (J

n
l
)
t
l
k
6: else if S
l
(1)==H then
7: W
l
k
= (J
n
l
)
t
l
k
8: else if S
l
(1)==U then
9: W
l
k
=
k
I
n
l
10: else
11: W
l
k
= 0
n
l
12: end if
13: end for
14: W
c,k
:= diag(W
1
k
, . . . , W
q
k
)
15: end for
Output: W
c,k

s
k=0
.
Algorithm 4.2 species the steps needed for the cost function and its rst derivative
evaluation. The ops per step for Algorithm 4.2 are as follows: i
i
i
i
4.5. Efcient cost function and rst derivative evaluation

63
2. (n +d)(n + 2d)dK
3
3. m(n + 1)d
4. msd
2
K
2
5. md
8. msd
2
K s(s + 1)d
2
K
2
/2
9. mnd + (2s + 1)(nd +n + 1)dK
2
Thus in total O(md(sdK
2
+n) +n
2
dK
3
+3nd
2
K
3
+2d
3
K
3
+2snd
2
K
2
) ops are
required for cost function and rst derivative evaluation. Note that the op counts depend
on the structure through s.
Algorithm 4.2 STLS cost function and rst derivative evaluation cost
Input: A, B, X, W
c,k

s
k=0
.
1:
k
= (I
K

_
X

)W
c,k
(I
K

_
X

, for k = 0, 1, . . . , s.
2: r = vec
_
(AX B)

_
.
3: Solve y
r
= r exploiting the block-banded and block-Toeplitz structure of , e.g., by
using the routines MB02GD from the SLICOT library and DPBTRS from the LAPACK
library.
4: f
0
= r

y
r
.
5: If only the cost function evaluation is required, output f
0
and stop.
6: Dene col(y
r,1
, . . . , y
r,m
) := y
r
, where y
r,i
R
d
; col(y
r,1
, . . . , y
r,m
) =: y
r
, where
y
r,i
R
dK
, m := m/K; and Y

r
:=
_
y
r,1
y
r,m

.
7: N
k
=

mk
i=1
y
r,i+k
y

r,i
, for k = 0, 1, . . . , s.
8: f

0
= 2A

Y
r
2

s
k=s

K
i,j=1
(W
a,k,ij
X W
a

b,k,ij
)N

k,ij
, where W
c,k,ij

R
(n+d)(n+d)
is the (i, j)th block of W
c,k
R
K(n+d)K(n+d)
, W
a,k,ij
R
nn
;
W
a

b,k,ij
R
nd
are dened as blocks of W
c,k,ij
as
W
c,k,ij
=:
_
W
a,k,ij
W
a

b,k,ij
W

ba,k,ij
W

b,k,ij
_
;
and N
k,ij
R
dd
is the (i, j)th block of N
k
R
dKdK
.
Output: f
0
, f

0
.
Using the computation of the cost function and its rst derivative, we can apply the
Broyden, Fletcher, Goldfarb, and Shanno (BFGS) quasi-Newton method. Thus the overall
algorithm for the computation of the STLS solution is Algorithm 4.3.
Amore efcient alternative, however, is to apply a nonlinear least squares optimization
algorithm, such as the LevenbergMarquardt
algorithm. Let = U

U be the Cholesky factorization of . Then f


0
= F

F, with
F := U
1
r. (Note that the evaluation of F(X) is cheaper than that of f
0
(X).) We do not
know an analytic expression for the Jacobian matrix J(X) = [F
i
/x
j
], but instead we
use the so-called pseudo-Jacobian J
+
proposed in [GP96]. The evaluation of J
+
can be
done efciently, using the approach described above for f

(X).
Moreover, by using the nonlinear least squares approach and the pseudo-Jacobian J
+
,
we have as a byproduct of the optimization algorithm an estimate of the covariance matrix i
i
i
i
64 Chapter 4. Structured Total Least Squares
Algorithm 4.3 Algorithm for solving the STLS problem stls
Input: the structure specication K, S and the matrices A and B.
1: Compute the matrices W
c,k
via Algorithm 4.1.
2: Compute the TLS solution X
(0)
of AX B by, e.g., the function MB02MD from the
SLICOT library.
3: Execute a standard optimization algorithm, e.g., the BFGS quasi-Newton method, for
the minimization of f
0
over X with initial approximation X
(0)
and with cost function
and rst derivative evaluation performed via Algorithm 4.2.
Output:

X the approximation found by the optimization algorithm upon convergence.
Table 4.1. Standard approximation problems that are special cases of the STLS problem
for particular structure specication K, S.
Problem Structure S K
Least squares (LS)
__
E n

,
_
U d

1
Total least squares (TLS)
_
U n +d

1
Mixed least squarestotal least squares (LS-TLS)
__
E n
1

,
_
U n
2

,
_
U d

1
Hankel low-rank approximation (HLRA)
_
H n + p m

p
SISO deconvolution
__
T n

,
_
U 1

1
SISO EIV system identication
__
H n + 1

,
_
H n + 1

1
V
x
= cov
_
vec(

X)
_
. As shown in [PS01, Chapter 17.4.7, equations (17)(35)],
V
x

_
J

+
(

X)J
+
(

X)
_
1
.
Using V
x
, we can compute statistical condence bounds for the estimate

X.
The solution method outlined in Algorithm 4.3, using the LevenbergMarquardt al-
gorithm, is implemented in C language. A description of the software package is given in
Appendix B.2.
4.6 Simulation Examples
The approximationproblems listedinTable 4.1are special cases of the block-Toeplitz/Hankel
STLS problem for particular choices of the structure specication K, S. If not given, the
third element of S
l
is by default equal to one.
Our goal is to show the exibility of the STLS problem formulation (STLS
X
) with a
structure of Assumption 4.4. More realistic applications of the STLS package are described
in Chapter 11, where real-life data sets for multi-input multi-output (MIMO) systemidenti-
cation are used. Special problems such as LS, TLS, and mixed LS-TLS should be solved by
the corresponding special methods. Still, they serve as benchmarks for the STLS package.
We show simulation examples for the problems of Table 4.1. The data is a perturbed
version of a true data generated by a true model. True model and true data refer to the i
i
i
i
4.6. Simulation examples 65
particular problem (see the description below) and are selected randomly. The perturbation
is a Gaussian noise with a covariance matrix
2
I.
Table 4.2 shows the scaled computation time t, cost function value f
0
(

X), i.e., error
of approximation, and relative error of estimation
e := |

X

X|
F
/|

X|
F
, where

X is the parameter of the true model
for the STLS package and for an alternative computational method, if there is one. The
scaling is done by the smaller of the two values: the one achieved by the STLS package and
the one achieved by the alternative method.
Next, we describe the simulation setup for the examples.
Least Squares
The LS problem AX B, where A R
mn
is exact and unstructured and B R
md
is perturbed and unstructured, is solved as an STLS problem with structure specication
S =
__
E n

,
_
U d

. In the simulation example, the solution of the STLS package is


checked by the MATLAB least squares solver \. In the example, m = 100, n = 5, d = 2,
and = 0.1.
In the LS case, the STLS optimization algorithmconverges in two iteration steps. The
reason for this is that the second order approximation used in the algorithm is actually the
exact LS cost function. Therefore, independent of the initial approximation, in one iteration
step the algorithmnds the global optimumpoint. An additional iteration is needed in order
to conclude that the computed approximation in the rst step is optimal.
Total Least Squares
The TLS problemAX B, where the data matrix C :=
_
A B

R
m(n+d)
is perturbed
andunstructured, is solvedas anSTLSproblemwithstructure specicationS =
_
U n +d

.
In the simulation example, the solution of the STLS package is checked by the function
tls.m that implements the SVD method for the computation of the TLS solution; see
Theorem 3.14. In the example, m = 100, n = 5, d = 2, and = 0.1.
Table 4.2. Comparison of the STLS package and alternative methods on simulation exam-
ples. tscaledexecutiontime, f
0
scaledcost functionvalue, escalederror of estimation.
The scaling is the smaller of the values achieved by the methods.
Problem STLS package Alternative method
t f
0
e t f
0
e function
LS 40 1 1.000000 1 1.000000000000 1 \
TLS 1 1 1.000000 2 1.000000000000 1 tls
LS-TLS 1 1 1.000000 5 1.000000000000 1 lstls
HLRA 1 1 1.000087 147 1.000000056132 1 faststln2
Deconvolution 1 1 1.000009 631 1.000000000002 1 faststln1
System ident. 1 1 1.000000 i
i
i
i
66 Chapter 4. Structured Total Least Squares
In the TLS case, the STLS algorithm converges in one iteration step, because the
default initial approximation used in the STLS package is the TLS solution.
Mixed Least SquaresTotal Least Squares
The mixed LS-TLS problem [VV91, Section 3.5] is dened as follows: AX B, where
A =
_
A
e
A
p

, A
p
R
mn
1
and B R
md
are perturbed and unstructured, and A
e

R
mn
2
is exact and unstructured. This problemis solved as an STLS problemwith structure
specication S =
__
E n
1

,
_
U n
2

,
_
U d

. In [VV91] an SVD based method for the


computation of the mixed LS-TLS solution is proposed. In the simulation example the
solution of the STLS package is checked by a MATLAB implementation lstls.m of the
exact mixed LS-TLS solution method. In the example, m = 100, n = 5, d = 2, n
1
= 1,
and = 0.1.
Hankel Low-Rank Approximation
The Hankel low-rank approximation problem [DM93, Section 4.5], [SLV04] is dened as
follows:
min
p
|p|
2
2
subject to H (p p) has given rank n. (4.16)
Here H is a mapping from the parameter space R
n
p
to the set of the m(n + p) block-
Hankel matrices, withblocksize pm. If the rankconstraint is expressedas H ( p)
_
X
I

= 0,
where X R
np
is an additional variable, then (4.16) becomes an STLS problem with
K = p and S =
_
H n + p m

.
The Hankel low-rank approximation problem has a system theoretic meaning of ap-
proximate realization or (nite-time) model reduction; see Section 11.4. In the single-input
single-output (SISO) case, i.e., when p = m = 1, the STLS package is checked by a MAT-
LAB implementation faststln2 of the method of [LMV00]. In the example, the true
parameter vector is p = col(1, . . . , 12) (to which corresponds x = col(1, 2)) and the
given vector is p = p + col(5, 0, . . . , 0).
The computed solutions by the STLS package and faststln2 approximate the
same locally optimal solution. In the example, however, the STLS package achieves bet-
ter approximation of the minimum point for 147 times less computation time. The huge
difference in the execution times is due to the MATLAB implementation of faststln1:
m-les that extensively use for loops are executed slowly in MATLAB (versions 7.0).
Single-Input Single-Output Deconvolution
The convolution of the sequences (. . . , a
1
, a
0
, a
1
, . . .) and (. . . , x
1
, x
0
, x
1
, . . .) is the
sequence (. . . , b
1
, b
0
, b
1
, . . .) dened as follows:
b
i
=

j=
x
j
a
ij
. (4.17) i
i
i
i
4.6. Simulation examples 67
Assume that x
j
= 0 for all j < 1 and for all j > n. Then (4.17) for i = 1, . . . , m can be
written as the following structured system of equations:
_

_
a
0
a
1
a
1n
a
1
a
0
a
2n
.
.
.
.
.
.
.
.
.
a
m1
a
m+n2
a
mn
_

_
. .
A
_

_
x
1
x
2
.
.
.
x
n
_

_
. .
x
=
_

_
b
1
b
2
.
.
.
b
m
_

_
. .
b
. (4.18)
Note that the matrix A is Toeplitz structured and is parameterized by the vector
a = col(a
1n
, . . . , a
m1
) R
m+n1
.
The aim of the deconvolution problem is to nd x, given a and b. With exact data
the problem boils down to solving the system of equations (4.18). By construction it has an
exact solution. Moreover, the solution is unique whenever A is of full column rank, which
can be translated to a condition on a (persistency of excitation).
The deconvolution problemis more realistic and more challenging when the data a, b is
perturbed. We assume that m > n, so that the systemof equations (4.18) is overdetermined.
Because both a and b are perturbed and the Amatrix is structured, the deconvolution problem
is an STLS problem with the structure specication S =
__
T n

,
_
U 1

. Moreover,
under the assumption that the observations are obtained fromtrue values with additive noise
that is zero mean and normal, with covariance matrix a multiple of the identity, the STLS
method provides a maximum likelihood estimate of the true values.
We compare the solution obtained by the STLS package with the solution obtained by
the MATLAB implementation faststln1 of the method of [MLV00]. In the particular
simulation example, m = 200, n = 2, and = 0.05. The STLS package computes
slightly more accurate approximation of a minimumpoint using 631 times less computation
time. The difference in the execution time is again due to the MATLAB implementation of
faststln1.
Single-Input Single-Output Errors-in-Variables System
Identication
Consider the SISO linear time-invariant system described by the difference equation
y
t
+
n

=1
a

y
t+
=
n

=0
b

u
t+
(4.19)
and dene the parameter vector
x := col(b
0
, . . . , b
n
, a
0
, . . . , a
n1
) R
2n+1
.
Given a set of input/output data (u
1
, y
1
), . . . , (u
T
, y
T
) and an order specication n, we want
to nd the parameter x of a system that ts the data. i
i
i
i
68 Chapter 4. Structured Total Least Squares
For the time horizon t = 1, . . . , T, (4.19) can be written as the structured system of
equations
_

_
u
1
u
2
u
n+1
y
1
y
2
y
n
u
2
u
3
u
n+2
y
2
y
3
y
n+1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
u
m
u
m+1
u
T
y
m
y
m+1
y
T1
_

_
x =
_

_
y
n+1
y
n+2
.
.
.
y
T
_

_
, (4.20)
where m := T n. We assume that the time horizon is large enough to ensure m > 2n+1.
The system(4.20) is satised for exact data and a solution is the true value of the parameter x.
Moreover, under additional assumption on the input (persistency of excitation) the solution
is unique.
For perturbed data an approximate solution is sought, and the fact that the system of
equations (4.20) is structuredsuggests the use of the STLSmethod. Again, under appropriate
conditions for the data generating mechanism, an STLS solution provides a maximum
likelihood estimator.
The structure arising in the SISO identication problem is
S =
__
H n + 1

,
_
H n + 1

,
where n is the order of the system. Unfortunately, in this case we do not have an alternative
method by which the result of the STLS package can be veried. In the simulation example
we choose n = 3,
a = 0.151
_
1 0.9 0.49 0.145

,

b =
_
1 1.2 0.81 0.27

,
T = 1000, uwhite noise with unit variance, and
2
= 0.1. Fromthe compared LS, TLS, and
STLS solutions, the relative error of estimation e is largest for the LS method and is smallest
for the STLS method. (The numerical values are not shown in Table 4.2.) This relation of
the estimation errors can be expected with high probability for large sample size (T )
due to the statistical consistency of the TLS and STLS methods and the inconsistency of
the LS method. In addition, the STLS method being a maximum likelihood method is
statistically more efcient than the TLS method.
4.7 Conclusions
We considered an STLS problem with block-wise specied structure of the data matrix.
Each of the blocks can be block-Toeplitz/Hankel structured, unstructured, or exact. It is
shown that such a formulation is exible and covers as special cases many previously studied
structured and unstructured matrix approximation problems.
The proposed numerical solution method is based on an equivalent unconstrained
optimization problem (4.4). We proved that our Assumption 4.4 about the structure of the
data matrix implies that the weight matrix in the equivalent problem is block-Toeplitz and
block-banded. These properties are used for cost function and rst derivative evaluation
with linear in the sample size computational cost. i
i
i
i
4.7. Conclusions 69
Our results showthat a large varietyof STLSproblems canbe solvedefcientlywith
a single kernel computational toolefcient Cholesky factorization of a block-
Toeplitz and block-banded matrix.
The block-Toeplitz/Hankel structure is motivated by approximate modeling problems
for MIMOlinear time-invariant dynamical systems. For example, EIVsystemidentication,
approximate realization, and model reduction problems can be solved via the proposed STLS
algorithm.
Useful extensions of the results are
1. weighted STLS problems with cost function p

Wp, where W > 0 is diagonal,


and
2. regularized STLS problems, where the cost function is augmented with the regular-
ization term vec

(X)Qvec(X).
These extensions are still computable in O(m) ops per iteration by our approach with small
modications [MV06]. For example, the weighted STLS problem leads to weight matrix
that is no longer block-Toeplitz but still block-banded with bandwidth independent of m.
This property is sufcient for cost function and rst derivative evaluation with computational
complexity O(m). i
i
i
i
70 Chapter 4. Structured Total Least Squares i
i
i
i
Chapter 5
Bilinear
Errors-in-Variables Model
A bilinear EIV model is considered. It corresponds to an overdetermined set of linear
equations AXB = C, in which the data A, B, C is perturbed by errors. An ALS estimator
is constructed that converges to the true value of the parameter X as the number of rows
in A and the number of columns in B tend to innity.
The estimator is modied for an application in computer vision. A pair of corre-
sponding points in two images are related via a bilinear equation, in which a parameter is
the fundamental matrix. The fundamental matrix contains information about the relative
orientation of the two images, and its estimation is a central problem in two-view motion
analysis.
5.1 Introduction
In this section, we generalize the linear model
AX B (5.1)
to the bilinear in the measurements model
AXB C. (5.2)
An example where the bilinear model (5.2) occurs is the total production cost model.
Example 5.1 (Total production cost model) Assume that p production inputs (materials,
parts, labor, etc.) are combined to make n products. Let b
k
, k = 1, . . . , p, be the price per
unit of the kth production input and x
jk
, j = 1, . . . , n, k = 1, . . . , p, be the number of units
of the kth production input required to produce one unit of the jth product. The production
cost per unit of the jth product is the jth element of the vector
y = Xb, y R
n
.
Let a
j
, j = 1, . . . , n, be a required quantity to be produced of the jth product. The
total quantity needed of the kth production input is the kth element of the vector
z

= a

X, z R
p
.
71 i
i
i
i
72 Chapter 5. Bilinear Errors-in-Variables Model
The total production cost c is z

b = a

y, which gives a single measurement AXB = C


model
a

Xb = c.
Multiple measurements occur when a set of quantities a
(1)
, . . . , a
(N
1
)
, to be produced
of the n products, a set of prices per unit b
(1)
, . . . , b
(N
2
)
of the production inputs, and a set
of total costs c
il
, corresponding to all combinations of the required quantities and prices,
are given. Then the model is
_

_
a
(1)
.
.
.
a
(N
1
)
_

_
. .
A
X
_
b
(1)
b
(N
2
)

. .
B
=
_

_
c
11
c
1N
2
.
.
.
.
.
.
c
N
1
1
c
N
1
N
2
_

_
. .
C
.
Another example that involves the bilinear model (5.2) is the estimation of the funda-
mental matrix in two-view motion analysis. The fundamental matrix estimation problem is
treated in detail in the latter sections of this chapter.
The TLS method applied for (5.2) results in the following optimization problem:
min
X,A,B,C
_
_
_
A B C
_
_
2
F
s.t. (AA)X(B B) = C C. (5.3)
As mentioned in [Ful87], the TLS estimate

X
tls
(a global minimum point of (5.3)) is biased.
Moreover (5.3) is a nonconvex optimization problem, whose solution requires computa-
tionally demanding optimization methods that are not guaranteed to nd a global minimum
point.
We use instead an ALS estimator that is consistent and computationally cheap. A
strong assumption needed for the ALS estimator, however, is that the covariance structure of
the measurement noises is known exactly. In contrast, the TLSmethod needs the covariances
up to a scaling factor. In the fundamental matrix estimation problem, we derive a noise
variance estimation procedure that overcomes this deciency.
5.2 Adjusted Least Squares Estimation of a Bilinear
Model
In the model (5.2), A R
N
1
n
, B R
pq
, and C R
N
1
q
are observations and X
R
np
is a parameter of interest. We assume that the observations are noisy measurements
of true values

A,

B, and

C, i.e.,
A =

A+

A, B =

B +

B, C =

C +

C. (5.4)
The true values

A,

B, and

C of the observations are assumed to satisfy the bilinear model

A

X

B =

C for some true value

X of the parameter. Fromthe point of viewof EIVmodeling,

C represents the equation error, while



A and

B represent the measurement errors.
Looking for asymptotic results in the estimation of X, we x the dimensions of X
n and pand let the number of measurements, m and q, increase. The measurements i
i
i
i
5.2. Adjusted least squares estimation of a bilinear model 73
are represented by the rows of A, the columns of B, and the elements of C. Dene the
covariance matrices
V

A
:= E

A


A, V

B
:= E

B

B

.
The assumptions are enumerated with Roman numerals.
(i) The elements of

A,

B, and

C are centered random variables with nite second order
moments. The elements of any one of the matrices

A,

B,

C are independent of the
elements of the other two matrices. The covariance matrices V

A
and V

B
are known.
Consider rst the LS cost function
Q
ls
(X; A, B, C) := |AXB C|
2
F
. (5.5)
In the space of matrices R
np
, we introduce a scalar product T, S) := trace(TS

). The
derivative Q
ls
/X is a linear functional on R
np
:
1
2
Q
ls
X
(H) = trace
_
(AXB C)(AHB)

_
= trace
_
A

(AXB C)B

_
= A

(AXB C)B

, H). (5.6)
We identify the derivative with the matrix that represents it in (5.6), so that we redene
1
2
Q
ls
X
:= A

(AXB C)B

.
The LS estimator

X
ls
is the solution of the optimization problem
min
X
Q
ls
(X; A, B, C)
or, equivalently, the solution of the score equation

ls
(X; A, B, C) := Q
ls
/X = (A

A)X(BB

) A

CB

= 0. (5.7)
For the estimator, we can take

X
ls
:= (A

A)

CB

(BB

,
which satises (5.7) if A

A and BB

are nonsingular. In the absence of measurement


errors, i.e., when

A = 0 and

B = 0, the LS estimator is consistent.
If

A = 0, the partial least squares estimator

X
pa
:= TLS solution of XB = (A

A)

C (5.8)
is consistent. Similarly, if

B = 0, the estimator

X
pb
:= TLS solution of AX = CB

(BB

(5.9)
is consistent. The partial least squares estimators

X
pa
and

X
pb
are inconsistent when both
A and B are noisy. i
i
i
i
74 Chapter 5. Bilinear Errors-in-Variables Model
Next, we are looking for a corrected score function
als
, such that
E[
als
(X;

A+

A,

B +

B, C) [ C ] =
ls
(X;

A,

B, C), for all X,

A,

B, and C.
The ALS estimator

X
als
is dened from the equation

als
(X; A, B, C) = 0. (5.10)
In order to solve (5.2), we look for a correction applied on the LS score func-
tion
ls
, such that
als
=
ls
. By assumption (i),
E[
ls
(X;

A+

A,

B +

B, C) [ C ]
=
ls
(X;

A,

B, C) +E

A


AX

B

B

+E

A


AX

B

B

+V

A
XV

B
=
ls
+
1
(

B) +
2
(

A) +V

A
XV

B
,
where

1
(

B) := V

A
X

B

B

and
2
(

A) :=

A


AXV

B
.
To nd a proper correction term , consider
E
1
(

B +

B) = V

A
X

B

B

+V

A
XV

B
(5.11)
and
E
2
(

A+

A) =

A


AXV

B
+V

A
XV

B
. (5.12)
Then
(A, B) =
1
(B) +
2
(A) V

A
XV

B
and

als
(X; A, B, C)
= (A

A)X(BB

) A

CB

A
X(BB

) (A

A)XV

B
+V

A
XV

B
= (A

AV

A
)X(BB

B
) A

CB

.
As an estimator we can take

X
als
:= (A

AV

A
)

(A

CB

)(BB

B
)

. (5.13)
If A

AV

A
and BB

B
are nonsingular, then (5.13) satises (5.10). These matrices
are nonsingular with probability tending to one as the number of measurements tend to
innity. i
i
i
i
5.3. Properties of the adjusted least squares estimator 75
5.3 Properties of the Adjusted Least Squares
Estimator
We introduce further assumptions.
(ii) The rows of

A are independent, the columns of

B are independent, and all elements
of

C are independent.
(iii) E a
4
ij
const, E

b
4
kl
const, and E c
2
il
const.
(iv) With V
A
:=

A


A and V
B
:=

B

B

max
(V
A
) +m

2
min
(V
A
)
0 as N
1
and

max
(V
B
) +q

2
min
(V
B
)
0 as N
2
.
Assumption (iv) corresponds to the condition of weak consistency, given in [Gal82]
for the maximum likelihood estimator in the linear model (5.1).
Theorem 5.2 (Weak consistency). Under assumptions (i) to (iv), the ALS estimator

X
als
converges to

X in probability as N
1
and N
2
.
Proof. See [KMV03, Theorem 1].
Under more restrictive assumptions than (iii) and (iv), the ALS estimator is strongly
consistent.
(iii) E[ a
ij
[
2r
const, E[

b
kl
[
2r
const, and E[ c
il
[
2r
const for a xed real number
r 2.
(iv) For a xed N

1
1,

N
1
=N

1
_
N
r/2
1

r
min
(V
A
)
+

r
max
(V
A
)

2r
min
(V
A
)
_
< ,
and for a xed N

2
1,

N
2
=N

2
_
N
r/2
2

r
min
(V
B
)
+

r
max
(V
B
)

2r
min
(V
B
)
_
< ,
where r is dened in assumption (iii).
Theorem 5.3 (Strong consistency). Under assumptions (i), (ii), (iii), and (iv), the ALS
estimator

X
als
converges to

X almost surely as N
1
and N
2
.
Proof. See [KMV03, Theorem 2].
In [KMV03, Section 5], we prove the following rate of convergence:
|

X
als


X|
F
=
_

N
1
+
_

max
(V
A
)

min
(V
A
)
+

N
2
+
_

max
(V
B
)

min
(V
B
)
_
O
p
(1) . i
i
i
i
76 Chapter 5. Bilinear Errors-in-Variables Model
Under additional assumptions, the ALS estimator is asymptotically normal; see [KMV03,
Section 6]. It turns out that the asymptotic covariance matrix of the estimator does not
depend upon the covariance structure of

C.
In [KMV03, Section 7] a heuristic small sample modication of the ALS estimator
is proposed. The approach is similar to the one of [CST00]. The modied ALS estimator
has the same asymptotic properties as the ALS estimator but improves the results for small
sample size.
5.4 Simulation Examples
In this section, we apply the ALS estimator to a hypothetical example. Consider the bilinear
model (5.2) with N
1
= N
2
= N and n = p = 2, i.e.,
A
..
N2
X
..
22
B
..
2N
= C
..
NN
.
The true data is

A =
_

_
I
2
.
.
.
I
2
_

_,

B =
_
I
2
I
2

, and

C =
_

_
I
2
I
2
.
.
.
.
.
.
I
2
I
2
_

_,
so that the true value of the parameter is

X = I
2
. The perturbations

A,

B, and

C are
selected in three different ways.
Equally sized errors. All errors a
ij
,

b
kl
, c
il
are independent, centered, and normally
distributed with common variance 0.01.
Differently sized errors. All errors a
ij
,

b
kl
, c
il
are independent, centered, and normally
distributed. The elements in the rst column of

Ahave variance 0.05 and the elements
in the second column of

A have variance 0.01. The elements in the rst row of

B
have variance 0.01 and the elements in the second row of

B have variance 0.05. All
elements of

C have variance 0.01.
Correlated errors. All errors a
ij
,

b
kl
, c
il
are independent and normally distributed.
All rows of

A have covariance 0.01 [
5 1
1 1
] and the elements are independent from row
to row. All columns of

B have covariance 0.01 [
1 1
1 5
] and the elements are independent
from column to column.
The estimation is performed for increasing number of measurements N. As a measure
of the estimation quality, we use the empirical relative mean square error
e(N) =
1
K
K

s=1
|

X

X
(s)
|
2
F
|

X|
2
F
,
where

X
(s)
is the estimate computed for the sth noise realization.
The following estimators are compared: i
i
i
i
5.4. Simulation examples 77

X
als
the ALS estimator,

X
m
the small sample modied ALS estimator [KMV03, Section 7],

X
ls
the LS estimator, and

X
pa
and

X
pb
the partial least squares estimators (5.8) and (5.9).
Figure 5.1 shows the relative mean square error e for N ranging from 20 to 100. The
consistency properties of the ALS estimators and the bias of the other estimators is most
clearlyseeninthe experiment withcorrelatederrors. Note that the small sample modication
indeed improves the relative mean square error and for large sample size converges to the
original ALS estimator.
20 40 60 80 100
0
1
2
3
4
5
6
x 10
3
Equally sized errors

X
ls

X
pa

X
pb

X
m

X
als
N
e
(
N
)
20 40 60 80 100
0
0.005
0.01
0.015
0.02

X
ls

X
pa
X
pb

X
m

X
als
N
e
(
N
)
Differently sized errors
20 40 60 80 100
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04

X
ls

X
pa

X
pb

X
m
N
e
(
N
)
Correlated errors

X
als
Figure 5.1. Relative mean square error e as a function of N. i
i
i
i
78 Chapter 5. Bilinear Errors-in-Variables Model
5.5 Fundamental Matrix Estimation
Suppose that two images are captured by a mobile camera and N matching pairs of pixels
are located. Let
u
(i)
= col(u
(i)
1
, u
(i)
2
, 1) and v
(i)
= col(v
(i)
1
, v
(i)
2
, 1), for i = 1, . . . , N,
be the homogeneous pixel coordinates in the rst and second images, respectively. The
so-called epipolar constraint
v
(i)
Fu
(i)
= 0, i = 1, . . . , N, (5.14)
relates the corresponding matching pixels, where F R
33
, rank(F) = 2 is the funda-
mental matrix. Estimation of F from the given data v
(i)
, u
(i)
, i = 1, . . . , N, is called
structure from motion problem and is a central problem in computer vision. We adapt the
ALS estimator (5.13) to the fundamental matrix estimation problem.
The fundamental matrix estimation problemis not a special case of the bilinear model
considered in Sections 5.15.4. Note that with
A :=
_
v
(1)
v
(N)

, B :=
_
u
(1)
u
(N)

, and C := 0,
(5.14) represents only the diagonal elements of the equation AFB = C. Moreover, the C
matrix is noise-free, and the parameter F is of rank two and normalized. Thus the ALS
estimator derived for the bilinear model (5.2) cannot be used directly for the estimation of
the fundamental matrix F.
As in Section 5.1, we assume that the given points are noisy observations
u
(i)
= u
(i)
+ u
(i)
, and v
(i)
= v
(i)
+ v
(i)
, for i = 1, . . . , N (5.15)
of true values u
(i)
and v
(i)
that satisfy the model
v
(i)

F u
(i)
= 0, for i = 1, . . . , N (5.16)
for some true value

F R
33
, rank(

F) = 2 of the parameter F. In addition, we assume
that

F is normalized by |

F|
F
= 1.
In the absence of noise, equations (5.14) have an exact solution

F. The so-called
eight-point algorithm [Har97] computes it from N = 8 given pairs of points. A lot of
techniques have been proposed to improve the accuracy of the eight-point algorithm in the
presence of noise [TM97, MM98, LM00]. From a statistical point of view, however, the
corresponding estimators are inconsistent. We review in more detail the estimator proposed
in [MM98].
In [MM98], the model (5.14) is transformed into the form
_
u
(i)
v
(i)
. .
a
(i)
_

vec(F)
. .
f
= 0, for i = 1, . . . , N. (5.17)
Dening the matrix A :=
_
a
(1)
a
(N)

, (5.17) becomes the system of equations


Af = 0. The normalization condition |F|
F
= 1 for F implies the normalization condi-
tion |f| = 1 for f. With noisy data, an estimate of f can be computed by solving the
optimization problem
min
f
|Af|
2
2
subject to |f| = 1, (5.18) i
i
i
i
5.5. Fundamental matrix estimation 79
which is a quadratically constrained least squares problem, so that its global solution

f
ls,1
can be computed from the SVD of A. The corresponding estimate

F
ls,1
is constructed
from

f
ls,1
in an obvious way. We denote this construction by

F
ls,1
:= vec
1
(

f
ls,1
). Finally,
the rank constraint is enforced by approximating

F
ls,1
with the nearest (in Frobenius norm)
rank-decient matrix

F
ls
. This requires another SVD.
For statistical analysis of

F
ls
, the vectors a
(i)
are interpreted as observations
a
(i)
= u
(i)
v
(i)
+ a
(i)
. (5.19)
The estimator

F
ls
is consistent under the assumption that the errors a
(1)
, . . . , a
(N)
are zero
mean independent and identically distributed (i.i.d.) random vectors. Such an assumption,
however, is not satised for the EIV model (5.15). Suppose that the measurements u
(i)
and v
(i)
are obtained with additive errors v
(i)
and v
(i)
that are independent zero mean i.i.d.
normal random vectors. Then the vectors a
(i)
are not normally distributed because their
elements involve the product of two coordinates. It can be shown that
E a
(i)
a
(i)
=
_
E u
(i)
u
(i)
_

_
v
(i)
v
(i)
_
+
_
u
(i)
u
(i)
_

_
E v
(i)
v
(i)
_
+
_
E v
(i)
v
(i)
_

_
E u
(i)
u
(i)
_
.
In [MM98], the estimator

F
ls
is called the TLS estimator. In fact,

F
ls
is the TLS
estimator for the transformed model Af = 0, |f| = 1. The TLS estimator for the original
problem (5.14) is (compare with (5.3))
min
F
u
(1)
,...,u
(N)
v
(1)
,...,v
(N)
N

i=1
_
|u
(i)
|
2
+|v
(i)
|
2
_
subject to (u
(i)
+ u
(i)
)

F(v
(i)
+ v
(i)
) = 0
for i = 1, . . . , N,
(5.20)
which is a different problem. It is a nonconvex optimization problem. Moreover, as noted
in Section 5.1, the TLS estimator

X
tls
(a global minimum point of (5.20)) is also biased. We
refer to

F
ls
as the LS estimator because in terms of the bilinear model (5.14) it minimizes
the equation error Af.
We make the following assumptions on the errors u
(i)
and v
(i)
in (5.15).
(i) u
(i)
and v
(i)
, for i = 1, . . . , N, are zero mean independent random variables.
(ii) cov( u
(i)
) = cov( v
(i)
) =
2
diag(1, 1, 0), for i = 1, . . . , N and certain > 0.
Let u
(i)
:= col( u
(i)
1
, u
(i)
2
, u
(i)
3
). Assumption (ii) means that the components of u
(i)
are
uncorrelated, u
(i)
3
is noise-free, and var
_
u
(i)
1
_
= var
_
u
(i)
2
_
=
2
. The same holds for v
(i)
.
These are more natural assumptions for the application at hand than the assumptions on a
(i)
needed for consistency of the LS estimator. i
i
i
i
80 Chapter 5. Bilinear Errors-in-Variables Model
5.6 Adjusted Least Squares Estimation of the
Fundamental Matrix
The LS cost function is
Q
ls
(F) :=
N

i=1
q
ls
(F; u
(i)
, v
(i)
), where q
ls
(F; u, v) :=
_
v

Fu
_
2
.
Next, we construct an adjusted cost function Q
als
(F) that leads to a consistent estimator. It
is dened by the identity
EQ
als
(F) = Q
ls
(F) for all F R
33
and u
(i)
, v
(i)
R
3
.
By assumption (i),
Q
als
(F) =
N

i=1
q
als
(F; u
(i)
, v
(i)
),
where q
als
satises the identity
Eq
als
(F; u + u, v + v) = q
ls
(F; u, v), for all F R
33
and u, v R
3
,
and u N(0, V ), v N(0, V ) independent, with V :=
2
diag(1, 1, 0).
The solution of (5.6) is
q
als
(F, u, v) := trace
_
_
vv

V
_
F
_
uu

V
_
F

_
. (5.21)
Indeed,
Eq
als
(F; u + u, v + v)
= Etrace
_
_
( v + v)( v + v)

V
_
F
_
( u + u)

( u + u)

V
_
F

_
= Etrace
_
_
v v

+ 2 v v

+ ( v v

V )
_
F
_
u u

+ 2 u u

+ ( u u

V )
_
F

_
.
After expanding the right-hand side and applying the expectation operator to the summands,
assumptions (i) and (ii) imply that all summands except for the rst one are equal to zero.
Thus
Eq
als
(F, u + u, v + v) = trace
_
_
v v

_
F
_
u u

_
F

_
.
But
trace
_
_
v v

_
F
_
u u

_
F

_
=
_
u

v
__
v

F u
_
=
_
v

F u
_
2
= q
ls
(F, u, v).
Then the solution of (5.6) is given by
Q
als
(F) = trace
_
N

i=1
_
v
(i)
v
(i)
V
_
F
_
u
(i)
u
(i)
V
_
F

_
. i
i
i
i
5.7. Properties of the fundamental matrix estimator

81
With f := vec(F),
Q
als
(F) = f

_
N

i=1
_
u
(i)
u
(i)
V
_

_
v
(i)
v
(i)
V
_
_
f.
Denote
S
N
:=
N

i=1
_
u
(i)
u
(i)
V
_

_
v
(i)
v
(i)
V
_
(5.22)
and let

f
als,1
arg min f

S
N
f subject to |f| = 1. (5.23)
The matrix S
N
is symmetric, so that the ALS estimator

f
als,1
is a normalized eigenvector
of S
N
associated with the smallest eigenvalue of S
N
.
Let

F
als,1
:= vec
1
(

f
als,1
). If rank(

F
als,1
) = 3, it is approximated by a rank-decient
matrix. Let

F
als,1
= UV

, where = diag(
1
,
2
,
3
) and U, V R
33
are orthogonal
matrices, be an SVD of

F
als,1
. The ALS estimator on the second stage is dened as

F
als
:= U diag(
1
,
2
, 0)V

,
i.e., the best low-rank approximation of

F
als,1
, according to the EckartYoungMirsky
theorem [EY36].
5.7 Properties of the Fundamental Matrix Estimator

Consistency of the estimator



F
als,1
implies consistency of the estimator

F
als
. Indeed, suppose
that |

F
als,1


F|
F
. Because rank(

F) = 2, for the estimator

F
als
on the second stage,
we have
|

F
als,1


F
als
|
F
|

F
als,1


F|
F
. (5.24)
Then
|

F
als


F|
F
|

F
als


F
als,1
|
F
+|

F
als,1


F|
F
2.
Note that the matrix

F also satises (5.16), and |



F|
F
= |

F|
F
= 1. Therefore we
estimate

F up to a sign.
Introduce the matrix
F
N
:=
1
N
N

i=1
_
u
(i)
u
(i)
_

_
v
(i)
v
(i)
_
. (5.25)
For the vector

f := vec(

F), we have (see (5.16))

F
N

f =
1
N
N

i=1
trace( v
(i)
v
(i)

F u
(i)
u
(i)

F

) = 0,
and F
N
0. Thus
min
(F
N
) = 0. We require that there exists N

such that rank(F


N
) =
8 for N N

. Moreover, we need a stronger assumption.


Let
1
(F
N
)
2
(F
N
)
9
(F
N
) = 0 be the eigenvalues of F
N
. i
i
i
i
82 Chapter 5. Bilinear Errors-in-Variables Model
(iii) There exist N

1 and c
0
> 0, such that for all N N

,
8
(F
N
) c
0
.
The minimization problem (5.23) could have a nonunique solution, but due to as-
sumption (iii), for N > N

() the smallest eigenvalue of S


N
will be unique and then the
estimator

f
als,1
will be uniquely dened, up to a sign.
The next assumptions are needed for the convergence
1
N
S
N
F
N
0 almost surely as N . (5.26)
(iv)
1
N

N
i=1
| u
(i)
|
4
const and
1
N

N
i=1
| v
(i)
|
4
const.
(v) For xed > 0, E| u
(i)
|
4+
const and E| v
(i)
|
4+
const.
For two matrices A and B of the same size, dene the distance between A and B as
the Frobenius norm of their difference, i.e.,
dist(A, B) := |AB|
F
.
Theorem 5.4 (Strong consistency). Under assumptions (i) to (v),
dist(

F
als
,

F, +

F) 0 almost surely as N . (5.27)
Proof. See [KMV02, Theorem 1].
The computation of the ALS estimator needs knowledge of the noise variance
2
.
When
2
is unknown, it can be estimated as follows:

2
= arg min

min
_
S
N
(
2
)
_

.
In [KMV02, Section 3], the ALS estimator using the noise variance estimate
2
instead of
the true noise variance
2
is proven to be consistent.
5.8 Simulation Examples
In this section, we present numerical results with the estimators

F and
2
. The data is
simulated. The fundamental matrix

F is a randomly chosen rank two matrix with unit
Frobenius norm. The true coordinates u
(i)
and v
(i)
have third components equal to one, and
the rst two components are vectors with unit normand randomdirection. The perturbations
u
(i)
and v
(i)
are selected according to the assumptions stated in this book; i.e., the third
components u
(i)
3
= v
(i)
3
= 0, and u
(i)
j
, v
(i)
j
N(0,
2
), are independent for i = 1, . . . , N
and j = 1, 2. In each experiment, the estimation is repeated 1000 times with the same true
data and different noise realizations.
The true value of the parameter

F is known, which allows evaluation of the results.
We compare three estimators:

F
als
(
2
)the ALS estimator using the true noise variance
2
,

F
als
(
2
)the ALS estimator using the estimated noise variance
2
, and i
i
i
i
5.8. Simulation examples 83
150 200 250 300 350 400 450
0.05
0.1
0.15
0.2
0.25
0.3
0.35
PSfrag
Estimation of

F

X
ls

X
als
(
2
)

X
als
(
2
)
N
e
100 500
0.0
0.4
150 200 250 300 350 400 450
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
10.1
100 500
9.1
10
Estimation of
2
N

2
,

2

2

2
10
5
Figure 5.2. Left: relative error of estimation e := |

F

F|
F
/|

F|
F
as a function of the
sample size N. Right: convergence of the noise variance estimate
2
to the true value
2
.
150 200 250 300 350 400 450
0.005
0.01
0.015
0.02
0.025
N

m
i
n
(

F
a
l
s
,
1
)
100 500
0.00
0.03
150 200 250 300 350 400 450
5
6
7
8
9
100 500
4
10
N
|
1
N
S
N

F
N
|
F
10
3
Figure 5.3. Left: distance from

F
als,1
to the set of rank-decient matrices. Right: conver-
gence of
1
N
S
N
to F
N
.

F
ls
the LS estimator.
Figure 5.2 shows the relative error of estimation e := |

F

F|
F
/|

F|
F
as a function
of the sample size N in the left plot and the convergence of the estimate
2
in the right plot.
Figure 5.3, left plot, shows the convergence of the estimator

F
als,1
to the set of rank-decient
matrices. This empirically conrms inequality (5.24). The right plot in Figure 5.3 conrms
the convergence of
1
N
S
N
F
N
as N ; see (5.26).
Figure 5.4 shows the function S
N
(
2
), used in the estimation of
2
, for N = 500
in the left plot and for N = 30 in the right plot. In general, S
N
(
2
) is a nonconvex, non-
differentiable function with many local minima. However, we observed empirically that
the number of local minima decreases as N increases. For larger sample sizes and smaller
noise variance, the function S
N
(
2
) becomes unimodal. i
i
i
i
84 Chapter 5. Bilinear Errors-in-Variables Model
0.4 0.6 0.8 1 1.2 1.4 1.6
x 10
4
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
N = 500

2
S
N
(

2
)

2

2
S
N
0.4 0.6 0.8 1 1.2 1.4 1.6
x 10
4
0
0.5
1
1.5
x 10
4

2
S
N
(

2
)

2
2
S
N
N = 30
Figure 5.4. The function S
N
(
2
) used for the estimation of
2
. Left: large sample size.
Right: small sample size.
5.9 Conclusions
We considered the bilinear model AXB = C. The TLS estimator is inconsistent in this
case. We constructed the ALS estimator, which is consistent and computationally cheap.
Conditions for weak and strong consistency were listed.
An open question is, What are the optimality properties of the ALS estimator? For the
linear model AX = B, in [KM00] it was shown that the ALS estimator is asymptotically
efcient when V

A
is known exactly and E

b
2
kl
are known up to a constant factor. It would
be interesting to check the following conjecture:
In the model AXB = C, the ALS estimator is asymptotically efcient when
V

A
and V

B
are known exactly and E c
2
il
are known up to a constant factor.
Estimation of the rank-decient fundamental matrix, yielding information about the
relative orientation of two images in two-viewmotion analysis, was considered. Aconsistent
estimator was derived by minimizing a corrected contrast function in a bilinear EIV model.
The proposed ALS estimator was computed in three steps: rst, estimate the measurement
error variance; second, construct a preliminary matrix estimate; and third, project that
estimate onto the space of singular matrices. i
i
i
i
Chapter 6
Ellipsoid Fitting
A parameter estimation problem for ellipsoid tting in the presence of measurement errors
is considered. The LS estimator is inconsistent and, due to the nonlinearity of the model, the
orthogonal regression estimator is inconsistent as well; i.e., these estimators do not converge
to the true value of the parameters as the sample size tends to innity. Aconsistent estimator
is proposed, based on a proper correction of the LS estimator. The correction is explicitly
given in terms of the true value of the noise variance.
In Section 6.2, we dene the quadratic EIV model. The LS and ALS estimators are
dened in Sections 6.3 and 6.4. Ellipsoid estimates are derived from the general quadratic
model estimates in Section 6.5. An algorithm for ALS estimation is outlined in Section 6.6.
We show simulation examples in Section 6.7.
6.1 Introduction
In this chapter, we consider the ellipsoid tting problem: given a set of data points
x
(1)
, . . . , x
(N)
, where x
(i)
R
n
,
nd an ellipsoid
E (A
e
, c) := x R
n
: (x c)

A
e
(x c) = 1 , A
e
=

A

e
> 0, (6.1)
that best matches them. The freedom in the choice of the matching criterion gives rise to
different estimation methods.
One approach, called algebraic tting, is to solve the optimization problem
min
A
e
,c
N

i=1
_
(x
(i)
c)

A
e
(x
(i)
c) 1
_
2
(6.2)
and to dene the estimate as any global optimal point. We will refer to (6.2) as the LS
method for the ellipsoid model.
85 i
i
i
i
86 Chapter 6. Ellipsoid Fitting
Another approach, called geometric tting, is to solve the optimization problem
min
A
e
,c
N

i=1
_
dist
_
x
(i)
, E (A
e
, c)
_
_
2
, (6.3)
where dist(x, E ) is the Euclidean distance from the point x to the set E . In the statistical
literature, (6.3) is called the orthogonal regression method.
Note 6.1 (Orthogonal regression TLS) The TLS method applied to the ellipsoid model
(6.1) results in the following optimization problem:
min
A
e
,c
x
(1)
,...,x
(N)
N

i=1
_
_
x
(i)
_
_
2
subject to
_
x
(i)
+ x
(i)
c
_

A
e
_
x
(i)
+ x
(i)
c
_
= 1
for i = 1, . . . , N.
(6.4)
Clearly,
dist(x, E ) = arg
_
min
x
|x|
2
subject to (x + x c)

A
e
(x + x c) = 1
_
and (6.4) is separable in x
(1)
, . . . , x
(N)
, so that the TLS problem (6.4) is equivalent to
the orthogonal regression problem(6.3). In (6.3), the auxiliary variables x
(1)
, . . . , x
(N)
are hidden in the dist function.
We assume that all data points are noisy measurements x
(i)
:= x
(i)
+ x
(i)
of some true
points x
(1)
, . . . , x
(N)
that lie on a true ellipsoid E (

A
e
, c); i.e., the model is a quadratic EIV
model. The measurement errors x
(1)
, . . . , x
(N)
are centered, independent, and identically
distributed (i.i.d.), and the distribution is normal with variance-covariance matrix
2
I, where

2
is the noise variance.
Due to the quadratic nature of the ellipsoid model with respect to the measurement x,
both the algebraic and the geometric tting methods are inconsistent in a statistical sense,
see [NS48] and the discussion in [Ful87, page 250]. We propose an ALS estimator that is
consistent.
The LS estimator, dened by (6.2), is a nonlinear least squares problem. We use a
computationally cheap, but suboptimal method to solve the optimization problem(6.2). The
equation dening the ellipsoid model is embedded in the quadratic equation
x

Ax +b

x +d = 0, A = A

> 0, (6.5)
which is linear in the parameters A, b, and d, so that a linear least squares estimation
is possible. For given estimates

A,

b, and

d of the parameters in (6.5), assuming that

A =

A

> 0, the estimates of the original parameters in (6.2) are given by


c :=
1
2

A
1

b and

A
e
:=
1
c

A c

d

A. (6.6)
The necessarycomputationfor the (suboptimal) LSestimator involves ndinganeigenvector
associated with the smallest eigenvalue of a symmetric matrix. We use the same indirect
approach to compute the ALS estimator. i
i
i
i
6.2. Quadratic errors-in-variables model 87
The correction needed for the ALS estimator is given explicitly in terms of the noise
variance
2
. We give an algorithm for ellipsoid tting that implements the theoretical
results. Its computational cost increases linearly with the sample size N. In [KMV04], we
present the statistical properties of the estimator and treat the case when
2
is unknown.
The orthogonal regression estimator, on the other hand, is computed via a local op-
timization method and scales worse with N and with the dimension n of the vector space.
In addition, due to the nonconvexity of the cost function in (6.3), the computed solution
depends on the supplied initial approximation. In degenerate cases (see [Nie02, pages 260
261]) the global minimum of (6.3) is not unique, so that there are several best tting
ellipses.
We point out several papers in which the ellipsoid tting problem is considered.
Gander, Golub, and Strebel [GGS94] consider algebraic and geometric tting methods for
circles and ellipses and note the inadequacy of the algebraic t on some specic examples.
Later on, the given examples are used as benchmarks for the algebraic tting methods.
Fitting methods, specic for ellipsoids, as opposed to the more general conic sections,
are rst proposed in [FPF99]. The methods incorporate the ellipticity constraint into the
normalizing condition and thus give better results when an elliptic t is desired. In [Nie01]
a new algebraic tting method is proposed that does not have as singularity the special case
of a hyperplane tting; if the best tting manifold is afne the method coincides with the
TLS method.
Astatistical point of viewonthe ellipsoidttingproblemis takenin[Kan94]. Kanatani
proposed an unbiased estimation method, called a renormalization procedure. He uses an
adjustment similar to the one in the present chapter, but his approach of estimating the
unknown noise variance is different from the one presented in [KMV04]. Moreover, the
noise variance estimate proposed in [Kan94] is still inconsistent; the bias is removed up to
the rst order approximation.
6.2 Quadratic Errors-in-Variables Model
A second order surface in R
n
is the set
B(A, b, d) := x R
n
[ x

Ax +b

x +d = 0 , (6.7)
where the symmetric matrix A S, the vector b R
n
, and the scalar d R are parameters
of the surface. If A = 0 and b ,= 0, then the surface (6.7) is a hyperplane, and if A is
positive denite and 4d < b

A
1
b, then (6.7) is an ellipsoid. Until Section 6.5, we will
only assume that B(A, b, d) is a nonempty set, but in Section 6.5, we will come back to the
ellipsoid tting problem, so that the parameters will be restricted.
Let

A S,

b R
n
, and

d R be such that the set B(

A,

b,

d) is nonempty and let the
points x
(1)
, . . . , x
(N)
lie on the surface B(

A,

b,

d), i.e.,
x
(i)

A x
(i)
+

x
(i)
+

d = 0, for i = 1, . . . , N. (6.8)
The points x
(1)
, . . . , x
(N)
are measurements of the points x
(1)
, . . . , x
(N)
, respectively, i.e.,
x
(i)
= x
(i)
+ x
(i)
, for i = 1, . . . , N, (6.9) i
i
i
i
88 Chapter 6. Ellipsoid Fitting
where x
(1)
, . . . , x
(N)
are the corresponding measurement errors. We assume that the mea-
surement errors form an i.i.d. sequence and the distribution of x
(i)
, for all i = 1, . . . , N, is
normal and zero mean, with variance-covariance matrix
2
I
n
, i.e.,
E x
(i
1
)
x
(i
2
)
= 0, for all i
1
, i
2
= 1, . . . , N, i
1
,= i
2
,
and
x
(i)
N(0,
2
I
n
), for i = 1, . . . , N,
where
2
> 0 is called the noise variance.
The matrix

A is the true value of the parameter A, while

b and

d are the true values
of the parameters b and d, respectively. Without additional constraints imposed on the
parameters, for a given second order surface B(A, b, d), the model parameters A, b, and d
are not unique: B(A, b, d) is the same surface for any real nonzero . This makes the
quadratic model, parameterized by A, b, and d, nonidentiable. To resolve the problem,
we impose a normalizing condition; e.g., the true values of the parameters are assumed to
satisfy the constraint
|

A|
2
F
+|

b|
2
+

d
2
= 1. (6.10)
Then the estimates are unique up to a sign.
Note 6.2 (Invariance of the LS and ALS estimators) As shown in [Boo79, page 59],
[GGS94, page 564, equation (3.5)], and [Pra87, page 147], the constraint (6.10) is not
invariant under Euclidean transformations. As a result, the LS estimator is not invariant un-
der Euclidean transformations. Such a dependence on the coordinate system is undesirable.
Suggestions for making the LS estimator invariant can be found in [Nie01].
The following question arises. Are the ALS estimators derived with the constraint
(6.10) invariant? If the noise variance is xed, the answer is negative. However, if we are
allowed to modify the noise variance after the transformation of the data, then the ALS
estimator can be made invariant.
Amodication of the noise variance that ensures invariance under Euclidean transfor-
mations is the noise variance estimation procedure derived in [KMV04]. We demonstrate
the invariance properties of the ALS estimator with estimated noise variance by a simulation
example in Section 6.7. Rigorous analysis is presented in [SKMH05].
6.3 Ordinary Least Squares Estimation
The LS estimator for the second order surface model (6.7), subject to the normalizing con-
dition (6.10), is dened as a global minimum point of the following optimization problem:
min
A,b,d
N

i=1
_
x
(i)
Ax
(i)
+b

x
(i)
+d
_
2
subject to
_
A = A

,
|A|
2
F
+|b|
2
+d
2
= 1.
(6.11)
The LS cost function is
Q
ls
(A, b, d) =
N

i=1
q
ls
(A, b, d; x
(i)
), i
i
i
i
6.3. Ordinary least squares estimation 89
where the elementary LS cost function
q
ls
(A, b, d; x) = (x

Ax +b

x +d)
2
measures the discrepancy of a single measurement point x from the surface B(A, b, d).
In order to derive the solution of (6.11), we introduce a parameter vector containing
all decision variables. Let vec
s
: S R
(n+1)n/2
be an operator, a symmetric matrix
vectorizing operator, that stacks the upper triangular part of A in a vector. The vector of
decision variables is
:= col
_
vec
s
(A), b, d
_
, (6.12)
an element of the parameter space R
n

, n

:= (n + 1)n/2 +n + 1.
Dene the symmetric Kronecker product
s
by
x

Ax = (x
s
x)

vec
s
(A) for all x R
n
and A S. (6.13)
We have for the elementary LS cost function
q
ls
(; x) = (x

Ax +b

x +d)
2
=
_
_
(x
s
x)

. .
y

_
_
vec
s
(A)
b
d
_
_
_
2
= (y

)
2
=

yy

(6.14)
and for the LS cost function
Q
ls
() =
N

i=1
q
ls
(; x
(i)
) =
N

i=1
_
(y
(i)
)
_
2
= |Y |
2
=

Y ,
where
y
(i)
:=
_
_
x
(i)

s
x
(i)
x
(i)
1
_
_
, for i = 1, . . . , N, and Y :=
_

_
y
(1)
.
.
.
y
(N)
_

_.
Let H R
n

be a matrix, such that


|H|
2
= |A|
2
F
+|b|
2
+d
2
, for all A S, b R
n
, d R, (6.15)
where is dened in (6.12).
The LS estimation problem(6.11) is equivalent to the following classical quadratically
constrained least squares problem:
min

|Y |
2
subject to |H|
2
= 1. (6.16)
The LSestimator

ls
is H
1
v
min
, where v
min
is a normalizedeigenvector of H
T
Y

Y H
1
,
corresponding to the smallest eigenvalue. i
i
i
i
90 Chapter 6. Ellipsoid Fitting
In order to avoid the computation of the Grammatrix Y

Y , one can obtain the solution


from the SVD of Y H
1
. Let
Y H
1
= UV

, with U

U = I, V

V = I, and
= diag(
1
, . . . ,
n
),
1

n
0. (6.17)
Then

ls
is H
1
v
min
, where v
min
is the last column of the matrix V .
Note 6.3 The matrix H that ensures (6.15) is a diagonal matrix with diagonal elements equal
to 1 or

2, where the latter correspond to the off-diagonal elements of A; see Note 6.5.
Since the normalizing condition (6.10) is arbitrary, however, we can choose any nonsingular
matrix H in (6.16). Particularly simple is H = I. The LS and ALS estimators depend
on the normalizing condition, but the ALS estimator is consistent for any nondegenerate
normalizing condition, i.e., for any full-rank matrix H.
Note that vec
s
(xx

) ,= x
s
x. One can verify that x
s
x = D vec
s
(xx

), where D
is a diagonal matrix with diagonal elements equal to 1 or 2; the latter corresponds to the
off-diagonal elements of xx

appearing in the product D vec


s
(xx

); see Note 6.5.


6.4 Adjusted Least Squares Estimation
The LS estimator is readily computable but it is inconsistent. We propose an adjustment
procedure that denes a consistent estimator. The proposed approach is due to [KZ02] and
is related to the method of corrected score functions; see [CRS95, Section 6.5].
The ALS estimator

als
is dened as a global minimum point of the following opti-
mization problem:
min

Q
als
() subject to |H|
2
= 1,
where the ALS cost function Q
als
is
Q
als
() =
N

i=1
q
als
(; x
(i)
) for all R
n

.
Let x = x + x, where x is normally distributed with zero mean and variance-covariance
matrix
2
I. The elementary ALS cost function q
als
is dened by the identity
Eq
als
(, x + x) = q
ls
(, x), for all R
n

and x R
n
. (6.18)
We motivate the denition of the ALS cost function as follows:
Q
ls
() :=
N

i=1
q
ls
(; x
(i)
), for all R
n

,
has as a global minimum point the true value of the parameter vector

:= col
_
vec
s
(

A),

b,

d
_
. i
i
i
i
6.4. Adjusted least squares estimation 91
Indeed, Q
ls
0 and by denition Q
ls
(

) = 0. From
EQ
als
= Q
ls
,
we see that, as the sample size grows, Q
als
approximates Q
ls
. Provided that Q
ls
has

as a
unique global minimum(the contrast condition of [KMV04]), the ALS estimator is strongly
consistent.
Next, we derive an explicit expression for the ALS cost function Q
als
. From (6.18)
and (6.14), we have
Eq
als
(, x) = q
ls
(, x) =

y y

=:

ls
( x),
where
y := col
_
( x
s
x), x, 1
_
and
ls
( x) := y y

.
Thus the ALS elementary cost function q
als
is quadratic in ,
q
als
(; x) =

als
(x),
where
E
als
(x) =
ls
( x). (6.19)
Under the normality assumption for the noise term x, (6.19) yields the following convolution
equation:
_
1
2
2
_
n/2
_

als
( x + x)
n

i=1
exp
_

x
2
i
2
2
_
d x
1
d x
n
=
ls
( x).
Solving for the unknown
als
is a deconvolution problem.
The deconvolution problem can be solved independently for the entries of
als
. The
elements of the matrix
ls
( x) are monomials of at most fourth order in x. Consider the
generic term

ls
( x) = x
i
x
j
x
p
x
q
, where i, j, p, q 0, 1, . . . , n.
We formally set x
0
= 1 and allow any of the indices to be zero, in order to allow
ls
to be
of order less than four.
Let r(s), s = 1, . . . , n, denote the number of repetitions of the index s in the monomial
x
i
x
j
x
p
x
q
. For example, let n = 2. In the monomial x
1
x
3
2
, r(1) = 1 and r(2) = 3, and in
the monomial x
4
1
, r(1) = 4 and r(2) = 0.
The functions
t
0
() := 1, t
1
() := , t
2
() :=
2

2
,
t
3
() :=
3
3
2
, and t
4
() :=
4
6
2

2
+ 3
4
(6.20)
have the property
Et
k
( x
s
+ x
s
) = x
k
s
, for all x
s
R and k = 0, 1, 2, 3, 4, i
i
i
i
92 Chapter 6. Ellipsoid Fitting
where x
s
N(0,
2
). Thus the polynomial

als
(x) :=
n

s=1
t
r(s)
(x
s
) (6.21)
has the property
E
als
(x) = x
i
x
j
x
p
x
q
=
ls
( x) for all x R
n
.
This shows that
als
is the desired solution. The matrix
als
is constructed element-wise in
the described way.
The ALS cost function Q
als
is quadratic in ,
Q
als
() =

als
, for all R
n

,
where

als
=
N

i=1

als
(x
(i)
).
Thus the function Q
als
is described thoroughly.
Example 6.4 ( The matrix
als
for n = 2 ) The model parameters are A = [
a
11
a
12
a
21
a
22
], b =
_
b
1
b
2

, and the scalar d. The parameter space is 6-dimensional with


:= col
_
vec
s
(A), b, d
_
=
_
a
11
a
12
a
22
b
1
b
2
d

.
From (6.13), we have
_
x
1
x
2

s
_
x
1
x
2

=
_
x
1
x
1
2x
1
x
2
x
2
x
2

,
so that
y := col
_
(x
s
x), x, 1
_
=
_
x
1
x
1
2x
1
x
2
x
2
x
2
x
1
x
2
1

ls
(x) = yy

=
_

_
x
4
1
2x
3
1
x
2
x
2
1
x
2
2
x
3
1
x
2
1
x
2
x
2
1
4x
2
1
x
2
2
2x
1
x
3
2
2x
2
1
x
2
2x
1
x
2
2
2x
1
x
2
x
4
2
x
1
x
2
2
x
3
2
x
2
2
x
2
1
x
1
x
2
x
1
x
2
2
x
2
1
_

_
,
with s indicating the symmetric elements.
The adjusted matrix
als
is
als
=
ls
+
als
, where the correction
als
is
_

_
3
4
6
2
x
2
1
6
2
x
1
x
2

als,13
3
2
x
1

2
x
2

2

als,22
6
2
x
1
x
2
2
2
x
2
2
2
x
1
0
3
4
6
2
x
2
2

2
x
1
3
2
x
2

2

2
0 0

2
0
0
_

_
, i
i
i
i
6.5. Ellipsoid estimation 93

als,13
=
4

2
(x
2
1
+x
2
2
), and
als,22
= 4
4
4
2
(x
2
1
+x
2
2
).
The correction matrix
als
, without the fourth order terms in , is derived in [Zha97,
Section 7]. The derivation in [Zha97], however, applies only for the two-dimensional case.
The recommended way of computing the LS estimator is via the SVD of Y H
1
. For
the ALS estimator we use the less accurate eigenvalue decomposition because the correction
is derived for
ls
= Y

Y and cannot be determined for the factor Y .


6.5 Ellipsoid Estimation
The ALS estimator
als
is derived for the general quadratic EIV model (6.8)(6.9). Now we
specialize it for the ellipsoid tting problem; i.e., we assume that the true surface belongs
to the class of surfaces
B(A
e
, c) = x R
n
: (x c)

A
e
(x c) = 1 (6.22)
for some true values

A
e
S,

A
e
=

A

e
> 0, and c of the parameters A
e
and c. The equation
dening B(

A
e
, c) can be written as
x


A
e
x 2(

A
e
c)

x + c


A
e
c 1 = 0,
or, with :=
_
|

A
e
|
2
F
+|2

A
e
c|
2
+ ( c


A
e
c 1)
2
_
1/2
,
x

(

A
e
/)x 2(

A
e
c/)

x + ( c


A
e
c 1)/ = 0.
Introduce the new parameters

A :=

A
e

,

b := 2

A
e
c

, and

d :=
c


A
e
c 1

.
As dened,

A,

b, and

d satisfy the normalizing condition (6.10).
We can go back to the original parameters

A
e
and c from

A,

b, and

d that satisfy (6.10)
by
c =
1
2

A
1

b and

A
e
=
1
c

A c

d

A. (6.23)
Note that = c


A c

d is nonzero. Let

A,

b,

d be the ALS estimator of the parameters

A,

b,

d. The estimator of the parameters

A
e
and c is given by the transformation (6.6).
If the obtained estimate

A
e
is indenite, we impose a posteriori positive deniteness
by the projection

A
e,2
:=

l:

l
>0

l
v
l
v

l
, (6.24)
where

A
e
=

n
l=1

l
v
l
v

l
is the eigenvalue decomposition of

A
e
. Indenite estimate

A
e
can be obtained because the estimator does not enforce the prior knowledge

A
e
=

A

e
> 0.
Clearly, the two-stage procedure

A
e
obtained on the rst stage and

A
e,2
on the second
stageis suboptimal. Empirical results, however, suggest that the event of having the
constraint

A
e
> 0 active is rare. Typically, it occurs for a small sample size with nonuniform
data point distribution and for data with outliers. Due to

A
e
=

A

e
> 0 and the consistency
of the estimator

A
e
, we expect that for large sample size,

A
e
> 0. i
i
i
i
94 Chapter 6. Ellipsoid Fitting
6.6 Algorithm for Adjusted Least Squares
Estimation

In this section, we summarize the estimation procedure described above by giving Algo-
rithm 6.1 for its computation. Notation similar to the MATLAB syntax for indexing the
elements of a matrix is used. For example, A(i
1
:i
2
, j
1
:j
2
) stands for the submatrix of A
obtained by selecting the elements with rst index in the set i
1
, i
1
+ 1, . . . , i
2
and with
second index in the set j
1
, j
1
+ 1, . . . , j
2
.
Note 6.5 If a general quadratic model is estimated, the normalizing condition is given
as prior knowledge, see Note 6.3. If an ellipsoid is estimated, however, the normalizing
condition is arbitrary. In Algorithm 6.1, we set H = I, which corresponds to a normalizing
condition
| vec
s
(A)|
2
+|b|
2
+d
2
= 1.
The matrix H corresponding to the normalizing condition (6.10) is
H =
_
_

D
I
n
1
_
_
,
where D is a diagonal matrix with diagonal elements
D
ii
=
_
2 if i I,
1 otherwise.
Note 6.6 (Known blocks of the matrix
als
) Algorithm 6.1 can be improved by setting
certain elements of
als
in advance and not by following the general adjustment procedure.
Consider a block partitioning of the matrices
ls
,
als
, and
als
according to the partitioning
of the vector
_
(x
s
x)

[ x

[ 1

;
e.g., for
ls
, denote

ls
=:
_
_

ls,11

ls,12

ls,13

ls,22

ls,23

ls,33
_
_
.
All elements of
ls
are monomials in x; moreover all elements of:

ls,11
(x) are of fourth order,

ls,12
(x) are of third order,

ls,13
(x) and
ls,22
(x) are of second order,

ls,23
(x) are of rst order, and
the scalar
ls,33
(x) = 1 is independent of x. i
i
i
i
6.6. Algorithm for adjusted least squares estimation

95
Algorithm 6.1 ALS ellipsoid tting als_fit
Input: a matrix X :=
_
x
(1)
x
(N)

R
nN
and the noise variance
2
.
1: Formthe tensor T R
5nN
, T(k, l, i) := t
k
_
X(l, i)
_
, for k = 0, . . . , 4, l = 1, . . . , n,
and i = 1, . . . , N, where the functions t
k
, k = 0, 1, 2, 3, 4, are given in (6.20).
2: Dene the vectors 1, l R
n+1
by 1 := col(1, . . . , 1, 1), l := col(1, . . . , n, 0), and form
the matrix M R
n

2
, n

:= (n +1)n/2 +n +1, M :=
_
vec
s
( l 1

) vec
s
( 1 l

.
We use M to nd the indices of x in the entries of
ls
( x). Note that
_
M(p, 1), M(p, 2)
_
are the indices of x in the pth entry of y := x
s
x. Recall that
ls
( x) := y y

. Thus the
indices of x in the (p, q)th entry of
ls
( x) are
_
M(p, 1), M(p, 2), M(q, 1), M(q, 2)
_
.
3: Dene a binary operator == by (l
1
==l
2
) := 1 if l
1
= l
2
and 0, otherwise, for all
l
1
, l
2
R. Form the tensor R R
n

n
,
R(p, q, l) =
_
M(p, 1)==l
_
+
_
M(p, 2)==l
_
+
_
M(q, 1)==l
_
+
_
M(q, 2)==l
_
,
for all q p and l = 1, . . . , n, which contains the number of repetitions of an index l in
an entry (p, q)th of
ls
( x). In terms of the function r, dened in Section 6.4, R stores
_
r(1), . . . , r(n)
_
for the entries of
ls
( x).
4: Compute

als
(p, q) =
N

i=1
n

l=1
T
_
R(p, q, l), l, i
_
for all q p.
This step corresponds to the correction (6.21) from Section 6.4.
5: Form the set I of the indices of the vector vec
s
(A), corresponding to the off-diagonal
elements of A, I = 1, . . . , (n + 1)n/2 l(l + 1)/2 : l = 1, . . . , n. (I
1
I
2
denotes the set difference of the sets I
1
and I
2
.) Note that l(l +1)/2 [ l = 1, . . . , n
are the indices of vec
s
(A), corresponding to the diagonal elements of A.
6: Form the symmetric matrix
als
by

als
(p, q) :=
_

_
4
als
(p, q) if p I and q I,
1
als
(p, q) if p , I and q , I,
2
als
(p, q) otherwise,
for all q p, and
als
(p, q) :=
als
(q, p), for all q < p.
7: Find an eigenvector

als
associated with the smallest eigenvalue of
als
.
8: Normalize

als
,

als
:=

als
/|

als
|.
9: Reconstruct the estimates

A,

b, and

d from the vector

als
,

A := vec
s
1
_

als
(1 : n(n + 1)/2)
_
,

b :=

als
_
n(n + 1)/2 + 1 : n

1
_
,

d :=

als
(n

),
where vec
s
1
: R
n(n+1)/2
S, forms a symmetric matrix out of the vector of the
elements in its upper triangular part.
10: Obtain the estimates of the ellipsoid parameters

A
e
and c by (6.6).
11: If

A
e
0, project

A on the positive denite cone by (6.24).
Output: the estimates

A
e
, c of the ellipsoid parameters. i
i
i
i
96 Chapter 6. Ellipsoid Fitting
For the blocks of order zero and one, there is no correction applied in the formation of the
matrix
als
. The correction for the elements of the blocks of order two is
2
I
n
. Thus for
the corresponding blocks of
als
, we have

als,22
(x) = xx


2
I
n
,
als,23
(x) = x,

als,13
(x) = x
s
x vec
s
(
2
I
n
),
als,33
(x) = 1.
Finally, the corresponding blocks of
als
are

als,22
=

N
i=1
x
(i)
x
(i)
N
2
I
n
,
als,23
=

N
i=1
x
(i)
,

als,13
=

N
i=1
x
(i)

s
x
(i)
vec
s
(N
2
I
n
),
als,33
= N,
and only the upper triangular part of the block
als,11
and the block
als,12
need to be
computed in steps 4 and 5 of Algorithm 6.1.
6.7 Simulation Examples
We show the ALS, LS, and orthogonal regression (OR) estimates for a test example from
[GGS94], called special data. It is designed to illustrate the inadequacy of the algebraic
tting method and to show the advantage of the OR method.
Only data points are given; even if they are generated with a true model, we do not
know it. For this reason the comparison is visual. Since the noise variance needed for the
ALS estimator is unknown, we estimate it via the procedure proposed in [KMV04].
Figure 6.1 shows the data points with the estimated ellipses superimposed on them.
The OR estimator is computed by a general purpose optimization algorithm (MATLAB
function fmincon). The cost function is evaluated as explained in [Zha97, Section 5.2].
For the rst test example (see Figure 6.1, left) the OR estimator is inuenced by
the initial approximation. Using the LS estimate as initial approximation, the optimization
algorithmconverges to a local minimum. The resulting estimate is the dashed-dotted ellipse
closer to the LS estimate. Using the ALS estimate as initial approximation, the obtained
estimate is the dashed-dotted ellipse closer to the ALS estimate. Next, we will consider the
better of the two OR estimates.
Although the sample size is only N = 8 data points, the ALS estimator gives good
estimates that are comparable with the OR estimate. The value of the OR cost function
(see (6.3)) is 3.2531 for the LS estimator, 1.6284 for the ALS estimator, and 1.3733 for
the OR estimate. The ALS estimator is less than 19% suboptimal. Moreover, the volume
of the OR estimate is 62.09 square units, while the volume of the ALS estimate is 34.37
square units, which is nearly twice as small. Visually (as well as in other senses), smaller
estimates are preferable.
In a second example, taken from[Sp97], the ALSestimate is close to the ORestimate;
see Figure 6.1, right. In terms of the OR cost function, the ALS estimate is less than 25%
suboptimal. The volume of the ALS estimate is comparable to that of the OR estimate.
Figure 6.2 illustrates the invariance properties of the ALS estimator with estimated
noise variance. The data used is again the special data from [GGS94]. The gure shows
translated, rotated, scaled, and translated and rotated data points with the corresponding
ALS estimates. i
i
i
i
6.7. Simulation examples 97
0 5 10 15
2
0
2
4
6
8
Test example special data from [GGS94].
x
1
x
2
2 4 6 8
2
0
2
4
6
8
10
12
Example from [Sp97].
x
1
Figure 6.1. Test examples. dashedLS, dashed-dottedOR, solidALS, data points,
centers of the estimated ellipses.
15 10 5 0 5 10 15 20 25 30 35
0
5
10
15
20
25
30
x
1
x
2
original
translated
scaled
rotated
translated
and rotated
(0, 0)
Figure 6.2. ALS estimates of the original, translated, rotated, scaled, and translated and
rotated data points. data points, centers of the estimated ellipses, point (0, 0). i
i
i
i
98 Chapter 6. Ellipsoid Fitting
6.8 Conclusions
The LS estimation of the ellipsoid parameters from noisy measurements of points on its
boundary is a nonlinear least squares problem. An indirect, suboptimal approach was
used that transforms the ellipsoid model to a general quadratic model and applies linear
least squares estimation. Due to the measurement errors, however, the LS estimator is
inconsistent.
Assuming that the measurement errors are normally distributed, a correction is derived
that uses the true measurement error variance and adjusts the LS cost function, so that
the resulting ALS estimator is consistent. An algorithm for the necessary computation is
outlined.
The ALSestimator is illustrated via simulation examples. Compared to the orthogonal
regression estimator, it has the advantage of being cheaper to compute and independent
of initial approximation. The computational efciency is crucial for higher dimensional
ellipsoid tting and for problems with large sample size. i
i
i
i
Part II
Dynamic Problems
99 i
i
i
i i
i
i
i
Chapter 7
Introduction to
Dynamical Models
With this chapter, we start to consider modeling problems for linear time-invariant (LTI)
systems. First, we give an introduction to the LTI model class using the behavioral language.
As in the static case, a key question is the representation of the model, i.e., howit is described
by equations. Again, the kernel, image, and input/output representations play an important
role, but other representations that bring additional structure into evidence are used as well.
Dynamical systems are much richer in properties than static systems. In the dynamic
case, the memory of the system is central, i.e., the fact that the past can affect the future.
The intuitive notion of memory is formalized in the denition of state. In addition, a key
role is played by the controllability property of the system. Every linear static systemhas an
image representation. In the dynamic case this is no longer true. A necessary and sufcient
condition for existence of an image representation is controllability.
7.1 Linear Time-Invariant Systems
Dynamical systems describe variables that are functions of one independent variable, re-
ferred to as time. In Chapter 2, a system was dened as a subset Bof a universum set U .
In the context of dynamical systems, U is a set of functions w : T W, denoted by W
T
.
The sets W and T R are called, respectively, signal space and time axis. The signal
space is the set where the system variables take on their values and the time axis is the set
where the time variable takes on its values. We use the following denition of a dynamical
system [Wil86a].
A dynamical system is a 3-tuple = (T, W, B), with T R the time axis, W
the signal space, and B W
T
the behavior.
The behavior B W
T
is the set of all legitimate functions, according to the system ,
from the universum set U = W
T
. When the time axis and the signal space are understood
from the context, as is often the case, we may identify the system = (T, W, B) with
its behavior B. As with any set, the behavior can be described in a number of ways.
In the context of dynamical systems, most often used are representations by equations
101 i
i
i
i
102 Chapter 7. Introduction to Dynamical Models
f : W
T
R
g
, i.e., B = w W
T
[ f(w) = 0 . The equations f(w) = 0 are called
annihilating behavioral equations.
Of interest are systems with special properties. In the behavioral setting,
a property of the system is always dened in terms of the behavior and then
translated to equivalent statements in terms of particular representations.
Similarly, the statement that w is a trajectory of , i.e., w B, is translated to more
convenient characterizations for numerical verication in terms of representations of .
Note 7.1 (Classical vs. behavioral theory) In the classical theory, system properties are
often dened on the representation level; i.e., a property of the system is dened as a
property of a particular representation. (Think, for example, of controllability, which is
dened as a property of a state space representation.) This has the drawback that such a
denition might be representation dependent and therefore not a genuine property of the
systemitself. (For example, a controllable system(see Section 7.5) for denition, may have
uncontrollable state representation.)
It is more natural to work instead the other way around.
1 Dene the property in terms of the behavior B.
2 Find the implications of that property on the parameters of the system in particular
representations. On this level, algorithms for verication of the property are derived.
The way of developing system theory as a sequence of steps 1 and 2 is characteristic for the
behavioral approach.
A static system (U , B) is linear when the universum set U is a vector space and
the behavior B is a linear subspace. Analogously, a dynamical system = (T, W, B) is
linear when the signal space Wis a vector space and Bis a linear subspace of W
T
(viewed
as a vector space in the natural way).
The universum set W
T
of a dynamical system has special structure that is not present
in the static case. For this reason dynamical systems are richer in properties than static
systems. Next, we restrict ourselves to the case when the time axis is either T = N or
T = Z and dene two propertiestime-invariance and completeness. In keeping with
tradition, we call a function w W
T
a time series.
A system = (N, W, B) is time-invariant if B B, where is the backward
shift operator (w)(t) := w(t + 1) and B := w [ w B. In the case T = Z, a
system = (Z, W, B) is time-invariant if B = B. Time-invariance requires that if a
time series w is a trajectory of a time-invariant system, then all its backward shifts
t
w,
t > 0, are also trajectories of that system.
The restrictionof the behavior B (R
w
)
T
tothe time interval [t
1
, t
2
], where t
1
, t
2
T
and t
1
< t
2
, is denoted by
B[
[t
1
,t
2
]
:= w (R
w
)
t
2
t
1
+1
[ there are w

and w
+
such that col(w

, w, w
+
) B.
A system = (T, W, B) is complete if
w[
[t
0
,t
1
]
B[
[t
0
,t
1
]
for all t
0
, t
1
T, t
0
t
1
= w B; i
i
i
i
7.2. Kernel representation 103
i.e., by looking at the time series w through a window of nite width t
1
t
0
, one can decide
if it is in the behavior or not. Moreover, if the window can be taken to have a xed width
t
1
t
0
= l, then the system is called l-complete. It turns out that a system is complete if
and only if its behavior is closed in the topology of pointwise convergence, i.e., if w
i
B
for i N and w
i
(t) w(t), for all t T, implies w B. Also, a system is l-complete
if and only if there is a difference equation representation of that system with l time shifts.
For LTI systems, the completeness property is also called nite dimensionality.
We consider the class of discrete-time complete LTI systems. Our generic notation
for the signal space is W= R
w
.
The class of all complete LTI systems with w variables is denoted by L
w
.
Next, we discuss representations of the class L
w
.
7.2 Kernel Representation
Consider the difference equation
R
0
w(t) +R
1
w(t + 1) + +R
l
w(t +l) = 0, where R

R
gw
. (DE)
It shows the dependence among consecutive samples of the time series w. Assuming that
R
l
,= 0, the maximum number of shifts is l. The integer l is called the lag of the equation.
Since in general (DE) is a vector equation, l is the largest lag among the lags l
1
, . . . , l
g
of
all scalar equations.
Obviously, (DE) induces a dynamical system via the representation
B = w (R
w
)
Z
[ (DE) holds .
One can analyze B using the difference equation. It turns out, however, that it is more
convenient to use polynomial matrix algebra for this purpose. (DE) is compactly written in
terms of the polynomial matrix
R(z) := R
0
+R
1
z
1
+R
2
z
2
+ +R
l
z
l
R
gw
[z]
as R()w = 0. Consequently, operations on the system of difference equations are repre-
sented by operations on the polynomial matrix R. The system induced by (DE) is
ker
_
R()
_
:= w (R
w
)
N
[ R()w = 0 . (KERrepr)
We call (KERrepr) a kernel representation of the system B := ker
_
R()
_
.
The following theorem summarizes the representation-free characterization of the
class of complete LTI systems, explained in the previous section, and states that
without loss of generality one can assume the existence of a kernel representation
B = ker
_
R()
_
of a system B L
w
. i
i
i
i
104 Chapter 7. Introduction to Dynamical Models
Theorem 7.2 (Willems [Wil86a]). The following are equivalent:
(i) = (Z, R
w
, B) is linear, time-invariant, and complete.
(ii) B is linear, shift-invariant, and closed in the topology of pointwise convergence.
(iii) There is a polynomial matrix R R
w
[z], such that B = ker
_
R()
_
.
The linearity of the system induced by (DE) follows from the linearity of (DE)
with respect to w. The shift-invariance follows from the time-invariance of the coef-
cients R
0
, . . . , R
l
, and the completeness follows from the fact that (DE) involves a nite
number l shifts of the time series. Thus (iii) = (i) is immediate. The reverse implication
(i) = (iii), on the other hand, requires proof; see [Wil86a, Theorem 5].
A kernel representation associated with a given B L
w
is not unique. The non-
uniqueness is due to:
1. linearly dependent equations (which refers to R not being full row rank) and
2. equivalence of the representations ker
_
R()
_
= 0 and ker
_
U()R()
_
= 0, where
U R
gg
[z] is a unimodular matrix.
A square polynomial matrix U is unimodular if it has a polynomial inverse. A necessary
and sufcient condition for U to be unimodular is its determinant to be a nonzero constant.
Two kernel representations of the same behavior are called equivalent.
Premultiplication of R with a unimodular matrix is a convenient way to repre-
sent a sequence of equivalence transformations on the system of difference equa-
tions (DE).
For a given system B L
w
, there always exists a kernel representation in which
the polynomial matrix R has full row rank [Wil91, Proposition III.3]. Such a kernel repre-
sentation is called a minimal kernel representation. In a minimal kernel representation, the
number of equations p := rowdim(R) is minimal among all possible kernel representa-
tions of B. All minimal kernel representations of a given system are in fact unimodularly
equivalent; i.e., if R

() = 0 and R

() = 0 are both minimal, then there is a unimodular


matrix U, such that R

= UR

.
There exists a minimal kernel representation B = ker
_
R()
_
, in which the number
of equations p = rowdim(R), the maximum lag l := max
i=1,...,p
l
i
, and the total lag n :=

p
i=1
l
i
are simultaneously all minimal over all possible kernel representations [Wil86a,
Theorem 6]. Such a kernel representation is called shortest lag representation. A kernel
representation B = ker
_
R()
_
is a shortest lag representation if and only if R(z) is row
proper. The polynomial matrix R =
_
r
1
r
p

, deg(r
i
) =: l
i
is row proper if the
leading row coefcient matrix (i.e., the matrix of which the (i, j)th entry is the coefcient
of the term with power l
i
of R
ij
(z)) is full row rank. It can be shown that the l
i
s are the
observability indices of the system.
The minimal and shortest lag kernel representations correspond to special prop-
erties of the R matrix: in a minimal representation, R is full row rank, and in a
shortest lag representation, R is row proper. i
i
i
i
7.3. Inputs, outputs, and input/output representation 105
A shortest lag representation is a special minimal representation, because a row proper
matrix is necessarily full row rank. A shortest lag representation, however, is still not
unique.
The minimal number of equations p, the lag l, and the total lag n are invariants
of B. It turns out that p is equal to the number of outputs, called output cardinality, in an
input/output representation. Correspondingly, the integer m := w p is also an invariant
of Band is called the input cardinality. It is equal to the number of inputs in an input/output
representation. The total lag n is equal to the state dimension in a minimal state space
representation of B. We use the following notation:
m(B) for the input cardinality of B,
p(B) for the output cardinality of B,
n(B) for the minimal state dimension of B, and
l(B) for the lag of B.
7.3 Inputs, Outputs, And Input/Output
Representation
Consider a projection operator R
ww
and a partitioning of the time series w (R
w
)
Z
into time series u and y as follows:
_
u
y
_
:=

w, where dim
_
u(t)
_
=: m, dim
_
y(t)
_
=: p, with m + p = w.
Respectively a behavior B L
w
is partitioned into two subbehaviors B
u
and B
y
. The
variables in u are called free if B
u
= (R
m
)
Z
. If, in addition, any other partitioning results
in no more free variables, then B
u
is called maximally free in B. A partitioning in which
B
u
is maximally free is called an input/output partitioning with u an input and y an output.
There always exists an input/output partitioning of the variables of B L
w
, in
fact a componentwise one; see [Wil86a, Theorem 2]. It is not unique, but the number of
free variables m and the number of dependent variables p are equal to, respectively, the
input cardinality and the output cardinality of B and are invariant. In a minimal kernel
representation ker
_
R()
_
= B, the choice of such a partitioning amounts to the selection
of a full-rank square submatrix of R. The variables corresponding to the columns of R that
form the full-rank submatrix are dependent variables and the other variables are free.
The inputs together with the initial conditions determine the outputs. This property
is called processing [Wil91, Denition VIII.2]. Also the inputs can be chosen so that
they are not anticipated by the outputs. Nonanticipation is also called causality [Wil91,
Denition VIII.4].
Let ker
_
R()
_
be a minimal kernel representation of B L
w
. One can always nd
a permutation matrix R
ww
, such that P R
pp
[z], dened by R =:
_
Q P

, has
a nonzero determinant and the rational polynomial matrix
G(z) := P
1
(z)Q(z) R
pm
(z) (TF) i
i
i
i
106 Chapter 7. Introduction to Dynamical Models
is proper. This requires selecting a submatrix P among all full-rank square submatrices
of R that has determinant of maximal degree. Then the corresponding partitioning of w,
col(u, y) :=

w, is an input/output partitioning. G being proper implies that u is not


anticipated by y; see [Wil91, Theorem VIII.7].
The difference equation
P()y = Q()u (I/Oeqn)
with an input/output partitioning is called an input/output equation, and the matrix G,
dened in (TF), is called the transfer function of the system B := ker
_
R()
_
.
The class of LTI complete systems with w variables and at most m inputs is denoted
by L
w
m
.
The system B L
w
induced by an input/output equation with parameters (P, Q)
(and input/output partitioning dened by ) is
B
i/o
(P, Q, ) := w := col(u, y) (R
w
)
N
[ P()y = Q()u. (I/Orepr)
(I/Orepr) is called an input/output representation of the system B := B
i/o
(P, Q, ). If
is the identity matrix I
w
, it is skipped in the notation of the input/output representation.
7.4 Latent Variables, State Variables, and State Space
Representations
Modeling fromrst principles invariably requires the addition to the model of other variables
apart from the ones that the model aims to describe. Such variables are called latent, and we
denote them by l (not to be confused with the lag of a difference equation). The variables w
that the model aims to describe are called manifest variables in order to distinguish them
from the latent variables.
An important result, called the elimination theorem [Wil86a, Theorem 1], states that
the behavior
B(R, M) :=
_
w (R
w
)
N

l (R
l
)
N
, such that R()w = M()l
_
(LVrepr)
induced by the latent variable equation
R()w = M()l (LVeqn)
is LTI. The behavior B(R, M) is called manifest behavior of the latent variable system.
The behavior of the manifest and latent variables together is called the full behavior of the
system. The elimination theorem states that if the full behavior is LTI, then the manifest
behavior is LTI; i.e., by eliminating the latent variables, the resulting system is still LTI.
A latent variable system is observable if there is a map w l, i.e., if the latent
variables can be inferred from the knowledge of the system and the manifest variables. The
kernel representation is a special case of the latent variable representation for R = I. i
i
i
i
7.4. Latent variables, state variables, and state space representations 107
State variables are special latent variables that specify the memory of the system.
More precisely, latent variables x are called state variables if they satisfy the following
axiom of state [Wil91, Denition VII.1]:
(w
1
, x
1
), (w
2
, x
2
) B, t N, and x
1
(t) = x
2
(t) = (w, x) B,
where
_
w(), x()
_
:=
_
_
w
1
(), x
1
()
_
for < t
_
w
2
(), x
2
()
_
for t.
A latent variable representation of the system is a state variable representation if there exists
an equivalent representation whose behavioral equations are rst order in the latent variables
and zeroth order in the manifest variables. For example, the equation
x = A

x +B

v, w = C

x +D

v
denes a state representation. It is called state representation with a driving input because v
acts like the input: v is free and, together with the initial conditions, determines a trajectory
w B. The system induced by the parameters (A

, B

, C

, D

) is
B
ss
(A

, B

, C

, D

) :=
_
w (R
w
)
N

v (R
v
)
N
and x (R
n
)
N
,
such that x = A

x +B

v, w = C

x +D

v
_
.
Any LTI system B L
w
admits a representation by an input/state/output equation
x = Ax +Bu, y = Cx +Du, w = col(u, y), (I/S/Oeqn)
in which both the input/output and the state structure of the system are explicitly dis-
played [Wil86a, Theorem 3]. The system B, induced by an input/state/output equation
with parameters (A, B, C, D) and , is
B
i/s/o
(A, B, C, D, ) := w := col(u, y) (R
w
)
N
[ x (R
n
)
N
,
such that x = Ax +Bu, y = Cx +Du. (I/S/Orepr)
(I/S/Orepr) is called an input/state/output representation of the systemB := B
i/s/o
(A, B, C,
D, ). Again, is skipped whenever it is I.
An input/state/output representation is not unique. The minimal state dimension n =
dim(x) among all input/state/output representations of B, however, is invariant (denoted
by n(B)).
We denote the class of LTI systems with w variables, at most m inputs, and minimal
state dimension at most n by L
w,n
m
. i
i
i
i
108 Chapter 7. Introduction to Dynamical Models
7.5 Autonomous and Controllable Systems
A system B is autonomous if for any trajectory w B the past
w

:=
_
. . . , w(2), w(1)
_
of w completely determines its future
w
+
:=
_
w(0), w(1), . . .
_
.
A system B is autonomous if and only if its input cardinality m(B) equals 0. Therefore,
an autonomous LTI system is parameterized by the pair of matrices A and C via the state
space representation
x = Ax, y = Cx, w = y. (AUT)
The system induced by the state space representation with parameters (A, C) is
B
i/s/o
(A, C) := w (R
p
)
N
[ x (R
n
)
N
, such that x = Ax, w = Cx.
The behavior of an autonomous system is nite dimensional; in fact, dim(B) = n(B).
Alternatively, anautonomous LTI systemis parameterizedina minimal kernel representation
B = ker
_
R()
_
by a square and nonsingular matrix R, i.e., R R
pp
[z], det(R) ,= 0.
The system B is controllable if for any two trajectories w
1
, w
2
B, there is a third
trajectory w B, such that w
1
(t) = w(t), for all t < 0, and w
2
(t) = w(t), for all
t 0. The subset of controllable systems contained in the set L
w
is denoted by L
w
ctrb
. A
noncontrollable systemBcan be represented [Wil91, Proposition V.8] as B = B
ctrb
B
aut
,
where B
ctrb
is the largest controllable subsystemin Band B
aut
is a (nonunique) autonomous
subsystem.
A test for controllability of the systemB in terms of the parameter R R
gw
[z] in a
kernel representationB = ker
_
R()
_
is givenin[Wil91, TheoremV.2]: Bis controllable if
and only if the matrix R(z) has a constant rank for all z C. Equivalently, Bis controllable
if and only if a matrix R that denes a minimal kernel representation of B is left prime. In
terms of the input/output representation B = B
i/o
(P, Q), Bbeing controllable is equivalent
to P and Q being left coprime.
The controllable subsystem B
ctrb
of B can be found via the factorization R = FR

,
where F R
gg
[z] and R

is prime: B
ctrb
= ker
_
R

()
_
. In general, left multiplication
of R with a nonsingular polynomial matrix changes the behavior: it amounts to adding an
autonomous subbehavior. Only left multiplication with a unimodular matrix does not alter
the behavior because it adds the trivial autonomous behavior 0.
7.6 Representations for Controllable Systems
The transfer function G parameterizes the controllable subsystem of B
i/o
(P, Q). Let Z be
the Z-transform
Z(w) = w(0) +w(1)z
1
+w(2)z
2
+
and consider the input/output equation
Z(y) = G(z)Z(u). (TFeqn) i
i
i
i
7.6. Representations for controllable systems 109
(TFeqn) is known as a frequency domain equation because G(e
j
) describes how the
sinusoidal input u(t) = sin(t) is processed by the system:
y(t) = [G(e
j
)[ sin
_
t +G(e
j
)
_
.
The system induced by G (with an input/output partition dened by ) is
B
i/o
(G, ) :=
_
w = col(u, y) (R
w
)
N

y = Z
1
_
G(z)Z(u)
_ _
. (TFrepr)
(TFrepr) is called a transfer function representation of the system B := B
i/o
(G, ).
In terms of the parameters of the input/state/output representation B
i/s/o
(A, B, C, D) =
B
i/o
(G), the transfer function is
G(z) = C(Iz A)
1
B +D. (TFI/S/O)
Dene the matrix valued time series H (R
pm
)
N
by H := Z
1
(G), i.e.,
G(z) = H(0) +H(1)z
1
+H(2)z
2
+ . (TFCONV)
The time series H is a parameter in an alternative, time-domain representation of the system
B
i/o
(G, ). Let be the convolution operator. Then
y(t) := (H u)(t) =
t1

=0
H()u(t ). (CONVeqn)
The system induced by H (with an input/output partition dened by ) is
B
i/o
(H, ) :=
_
w = col(u, y) (R
w
)
N

y = H u
_
. (CONVrepr)
(CONVrepr) is called a convolution representation of the system B := B
i/o
(H, ).
The matrices H(t), t 0, are called Markov parameters of the representation
B
i/o
(H). In terms of the parameters of the state space representation B
i/s/o
(A, B, C, D) =
B
i/o
(H), the Markov parameters are
H(0) = D, H(t) = CA
t1
B, t 1. (CONVI/S/O)
In addition to the transfer function (TFrepr) and convolution (CONVrepr) representa-
tions, a controllable systemB L
w
allows an image representation [Wil91, TheoremV.3];
i.e., there is a polynomial matrix M R
wg
[z], such that B = image
_
M()
_
, where
image
_
M()
_
:= w (R
w
)
N
[ l (R
w
)
N
, such that w = M()l . (IMGrepr)
The image representation is minimal if the number l of latent variables is minimal; i.e., there
are no extra external variables in the representation than necessary. The image representation
image
_
M()
_
of B is minimal if and only if M is full column rank. i
i
i
i
110 Chapter 7. Introduction to Dynamical Models
7.7 Representation Theorem
The following theorem summarizes the results presented in the previous sections of this
chapter.
Theorem 7.3 (LTI system representations). The following statements are equivalent:
(i) B is a complete LTI system with w variables, m inputs, and p := w m outputs, i.e.,
B L
w
and m(B) = m;
(ii) there is a (full rowrank) polynomial matrix R R
pw
[z], such that B = ker
_
R()
_
;
(iii) there are polynomial matrices Q R
pm
[z] and P R
pp
[z], det(P) ,= 0, P
1
Q
proper, and a permutation matrix R
ww
, such that B = B
i/o
(P, Q, );
(iv) there is a natural number n, matrices A R
nn
, B R
nm
, C R
pn
, and D
R
pm
, and a permutation matrix R
ww
, such that B = B
i/s/o
(A, B, C, D, );
(v) there is a natural number l N and polynomial matrices R R
pm
[z] and M
R
pl
[z], such that B = B(R, M);
(vi) there is a natural number l N and matrices A

R
nn
, B

R
nm
, C

R
pn
,
and D

R
pm
, such that B = B
ss
(A

, B

, C

, D

).
If in addition B is controllable, then the following statement is equivalent to (i)(vi):
(vii) there is a full column rank matrix M R
wm
[z], such that B = image
_
M()
_
.
A controllable system B has transfer function B
i/o
(G, ) and convolution B
i/o
(H, )
representations. These representations are unique when an input/output partitioning of the
variables is xed.
The proofs of most of the implications of Theorem 7.3 can be found in [Wil86a]
and [Wil91]. These proofs are constructive and give explicit algorithms for passing from
one representation to another.
Figure 7.1 shows schematically the representations discussed up to now. To the left
of the vertical line are representations that have no explicit input/output separation of the
variables and to the right of the vertical line are representations with input/output separation
of the variables. In the rst row are state space representations. The representations below
the second horizontal line exist only for controllable systems.
Transition from a latent variable representation to a representation without latent
variables, for example B(R

, M

) ker(R), involves elimination. Transition from a


representation without an input/output separation to a representation with such a separation,
for example ker(R) B
i/o
(P, Q), involves input/output selection. Transitions from a
representation in the second or third rows to a representation in the rst row is a realization
problem.
In principle, all transitions from one type of representation to another are of interest
(and imply algorithms that implement them). Moreover, all representations have special
forms such as the controller canonical form, the observer canonical form, balanced rep-
resentation, etc. Making the graph in Figure 7.1 connected sufces in order to be able to i
i
i
i
7.8. Parameterization of a trajectory 111
derive any representation, starting from any other one. Having a specialized algorithm that
does not derive intermediate representations, however, has advantages froma computational
point of view.
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
input/output
state space B
ss
(A

, B

, C

, D

) B
i/s/o
(A, B, C, D)

B(R

, M

) elimination

ker(R) i/o selection

B
i/o
(P, Q)

controllable image(M) B
i/o
(G)
B
i/o
(H)
Figure 7.1. Representations by categories: state space, input/output, and controllable.
7.8 Parameterization of a Trajectory
A trajectory w of B L
w
is parameterized by
1. a corresponding input u and
2. initial conditions x
ini
.
If Bis given in an input/state/output representation B = B
i/s/o
(A, B, C, D), then an input u
is given and the initial conditions can be chosen as the initial state x(1). The variation of
constants formula
w = col(u, y), y(t) = CA
t1
x
ini
+
t1

=1
CA
t1
B
. .
H(t)
u() +Du(t), t 1 (VC)
gives a parameterization of w. Note that the second term in the expression for y is the
convolution of H and u. It alone gives the zero initial conditions response. The ith column
of the impulse response H is the zero initial conditions response of the system to input
u = e
i
, where e
i
is the ith unit vector.
For a given pair of matrices (A, B), A R
nn
, B R
nm
, and t N, dene the
extended controllability matrix (with t block columns)
C
t
(A, B) :=
_
B AB A
t1
B

(C) i
i
i
i
112 Chapter 7. Introduction to Dynamical Models
and let C(A, B) := C

(A, B). The pair (A, B) is controllable if C(A, B) is full row rank.
By the CayleyHamilton theorem [Bro70, page 72], rank
_
C(A, B)
_
= rank
_
C
n
(A, B)
_
,
so that it sufces to check the rank of the nite matrix C
n
(A, B). The smallest natural
number i, for which C
i
(A, B) is full row rank, is denoted by (A, B) and is called the
controllability index of the pair (A, B). The controllability index is an invariant under state
transformation; i.e., (A, B) = (SAS
1
, SB) for any nonsingular matrix S. In fact,
(A, B) is an invariant of any system B
i/s/o
(A, B, , ), so that it is legitimate to use the
notation (B) for B L
w
. Clearly, (B) n(B).
Similarly, for a given pair of matrices (A, C), A R
nn
, C R
pn
, and t N,
dene the extended observability matrix (with t block rows)
O
t
(A, C) := col(C, CA, . . . , CA
t1
) (O)
and let O(A, C) := O

(A, C). The pair (A, C) is observable if O(A, C) is full column


rank. Again, rank
_
O(A, C)
_
= rank
_
O
n
(A, C)
_
, so that it sufces to check the rank of
O
n
(A, C). The smallest natural number i, for which O
i
(A, C) is full row rank, is denoted
by (A, C) and is called the observability index of the pair (A, C). The observability
index is an invariant under state transformation; i.e., (A, C) = (SAS
1
, CS
1
) for any
nonsingular matrix S. In fact, (A, C) is equal to the lag of any system B
i/s/o
(A, , C, ),
so that it is invariant and it is legitimate to use the notation (B) for B L
w
. Clearly,
l(B) = (B) n(B).
If the pairs (A, B) and (A, C) are understood from the context, they are skipped in
the notation of the extended controllability and observability matrices.
We dene also the lower triangular block-Toeplitz matrix
T
t+1
(H) :=
_

_
H(0)
H(1) H(0)
H(2) H(1) H(0)
.
.
.
.
.
.
.
.
.
.
.
.
H(t) H(t 1) H(1) H(0)
_

_
(T )
and let T (H) = T

(H). With this notation, equation (VC) can be written compactly as


_
u
y
_
=
_
0 I
O(A, C) T (H)
_ _
x
ini
u
_
. (VC)
If the behavior B is not given by an input/state/output representation, then the pa-
rameterization of a trajectory w B is more involved. For example, in an input/output
representation B = B
i/o
(P, Q), w can be parameterized by the input u and the l = deg(P)
values of the time series w
ini
:=
_
w(l + 1), . . . , w(0)
_
preceding w as follows:
y = O
i/o
w
ini
+T (H)u. (VCi/o)
Here O
i/o
is a matrix that induces a mapping fromw
ini
to the corresponding initial conditions
response. Let B
i/s/o
(A, B, C, D) = B
i/o
(P, Q). Comparing (VC) and (VC), we see that
the matrix O
i/o
can be factored as O
i/o
= O(A, C)X, where X is a matrix that induces the
map w
ini
x
ini
, called a state map [RW97].
The graph in Figure 7.2 illustrates the two representations introduced in this section
for a trajectory w of the system B L
w,n
m
. i
i
i
i
7.9. Complexity of a linear time-invariant system 113
w B L
w,n
m

.k
k
k
k
k
k
k
k
k
k
k
k
k

B = B
i/o
(P, Q, ), w
ini

B = B
i/s/o
(A, B, C, D, ), x
ini
Figure 7.2. Links among w B L
w,n
m
and its parameterizations in input/output and
input/state/output form.
7.9 Complexity of a Linear Time-Invariant System
In Chapter 2, we introduced the complexity of a linear system B as the dimension of B as
a subspace of the universum set. For an LTI system B L
w
and for T l(B),
dim(B[
[1,T]
) = m(B)T +n(B), (dimB)
which shows that the pair of natural numbers
_
m(B), n(B)
_
(the input cardinality and
the total lag) species the complexity of the system. The model class L
w,n
m
contains LTI
systems of complexity bounded by the pair (m, n).
In the context of system identication problems, aiming at a kernel representation
of the model, we need an alternative specication of the complexity by the input cardinal-
ity m(B) and the lag l(B). In general,
_
l(B) 1
_
p(B) < n(B) l(B)p(B),
so that
dim(B[
[1,T]
) m(B)T +l(B)p(B)
and the pair
_
m(B), l(B)
_
bounds the complexity of the system B.
The class of LTI systems with w variables, at most m inputs, and lag at most l is
denoted by L
w
m,l
.
This class species a set of LTI systems of a bounded complexity.
7.10 The Module of Annihilators of the Behavior

Dene the set of annihilators of the system B L


w
as
N
B
:= r R
w
[z] [ r

()B = 0
and the set of annihilators with length less than or equal to l as
N
l
B
:= r N
B
[ deg(r) < l . i
i
i
i
114 Chapter 7. Introduction to Dynamical Models
The sets N
B
and N
l
B
are dened as subsets of R
w
[z]. With some abuse of notation, we
consider also the annihilators as vectors; i.e., for r(z) =: r
0
+r
1
z + +r
l
z
l
N
B
, we
also write col(r
0
, r
1
, . . . , r
l
) N
B
.
Lemma 7.4. Let r(z) = r
0
+r
1
z + +r
l1
z
l1
. Then r N
l
B
if and only if
col

(r
0
, r
1
, . . . , r
l1
)B[
[1,l]
= 0.
The set of annihilators N
B
is the dual B

of the behavior B.
The proof of the following facts can be found in [Wil86a]. The structure of N
B
is that of the module of R[z] generated by p polynomial vectors, say r
(1)
, . . . , r
(p)
. The
polynomial matrix R := [r
(1)
. . . r
(p)
]

yields a kernel representation of the behavior B,


i.e., B = ker
_
R()
_
.
Without loss of generality, assume that R is row proper; i.e., ker
_
R()
_
is a shortest
lag kernel representation. By the row properness of R, the set of annihilators N
l
B
can be
constructed from the r
(k)
s and their shifts
N
l
B
= image
_
r
(1)
(z), zr
(1)
(z), . . . , z
l
1
1
r
(1)
(z) ; . . . ;
r
(p)
(z), zr
(p)
(z), . . . , z
l
p
1
r
(p)
(z)
_
.
The dimension of N
l
B
is l
1
+l
2
+ +l
p
= pl n.
In the proof of the fundamental lemma (see Appendix A.3), we need the following
simple fact.
Lemma 7.5. Let r
(1)
, . . . , r
(p)
, where deg(r
i
) =:
i
, be independent over the ring of
polynomials. Then
r
(1)
(z), zr
(1)
(z), . . . , z
l
1
1
r
(1)
(z) ; . . . ; r
(p)
(z), zr
(p)
(z), . . . , z
l
p
1
r
(p)
(z)
are independent over the eld of reals. i
i
i
i
Chapter 8
Exact Identication
With this chapter, we start to consider identication problems. The rst problem is the
simplest of this type: given a trajectory of an LTI system, nd a representation of the system
that produced this trajectory. The problem is dened and motivated in Sections 8.18.3.
Exact identication is closely related to the construction of the most powerful unfal-
sied model (MPUM). In Section 8.2, we dene the MPUM, and in Section 8.3, we dene
the identiability property. Under identiability, the MPUM of the data, which is explicitly
constructible from the data, coincides with the data generating system. This allows us to
nd the data generating system from data. An identiability test in terms of the given data
is presented in Section 8.4. This key result is repeatedly used in what follows and is called
the fundamental lemma.
In Section 8.5, we review algorithms for exact identication. Section 8.6 presents
algorithms for passing from data to a convolution representation. Section 8.7 reviews real-
ization theory and algorithms. Section 8.8 presents algorithms for computation of sequential
free responses, which are a key ingredient of direct algorithms for the construction of an
input/state/output representation of the MPUM.
In Section 8.9, we explain the relation of the algorithms presented to classical algo-
rithms for deterministic subspace identication. In particular, the orthogonal and oblique
projections correspond to computation of, respectively, free responses and sequential free
responses of the system. We comment on the inherent inefciency of the orthogonal and
oblique projections for the purpose of exact identication. Simulation results that compare
the efciency of various exact identication algorithms are shown in Section 8.10.
8.1 Introduction
In this chapter, we consider the following problem:
Given a trajectory w
d
of an LTI system B, nd a representation of B.
We refer to this most basic identication problem as an exact identication problem. It is
of interest to nd algorithms that make the transition from w
d
directly to any one of the
various possible representations of B; cf., Figure 7.1.
115 i
i
i
i
116 Chapter 8. Exact Identication
data identication

model
B
i/s/o
(A, B, C, D)
11
.
7

w
d
= (u
d
, y
d
) B
12

10
6

B
i/o
(G)
9

B
i/o
(H)
8

r
e
a
l
i
z
a
t
i
o
n

GF ED
@A BC
GF ED
@A BC
GF ED
@A BC
GF ED
@A BC
Figure 8.1. Data, input/output model representations, and links among them.
1. G(z) = C(Iz A)
1
B +D
2. Realization of a transfer function
3. H = Z
1
(G)
4. G = Z(H) =

t=0
H(t)z
t
5. Convolution y
d
(t) =

t
=0
H()u
d
(t )
6. Exact identication; see Algorithms 8.6 and 8.7
7. H(0) = D, H(t) = CA
t1
B, for t 1
8. Realization of an impulse response; see Algorithm 8.8
9. Simulation of the response under the input u
d
10. Exact identication; see Algorithm 8.1
11. Simulation of the response under the input u
d
and initial conditions x(1) = x
ini
12. Exact identication; see Algorithms 8.4 and 8.5
Figure 8.1 shows the representations with an input/output partition of the variables
that we considered before and the trajectory w
d
=: (u
d
, y
d
). The transitions from w
d
to
convolution, transfer function, and input/state/output representations are exact identication
problems. The transitions among the representations themselves are representation prob-
lems. Most notable of the representation problems are the realization ones: passing from
an impulse response or transfer function to an input/state/output representation.
The exact identication problemis an important systemtheoretic problem. It includes
as a special case the classical impulse response realization problem and is a prerequisite i
i
i
i
8.2. The most powerful unfalsied model 117
for the study of more involved approximate, stochastic, and stochastic/approximate identi-
cation problems (e.g., the GlTLS mist minimization problem, which is an approximate
identication problem). In addition, numerical algorithms for exact identication are useful
computational tools and appear as subproblems in other identication algorithms. By itself,
however, exact identication is not a practical identication problem. The data is assumed
to be exact and unless B is the trivial system B = (R
w
)
N
, a randomly chosen time series
w
d
(R
w
)
N
is a trajectory of B with probability zero.
Modied exact identication algorithms can be applied on data that is not necessarily
generated by a nite dimensional LTI system by replacing exact linear algebra operations
with approximate operations. For example, rank determination is replaced by numerical
rank determination (via SVD) and solution of a system of linear equations by LS or TLS
approximation. A lot of research is devoted to the problem of establishing alternatives
to w
d
B, under which such modied algorithms have desirable properties. Often this
problem is treated in the stochastic setting of the ARMAX model and the properties aimed
at are consistency and asymptotic efciency.
Note 8.1 (Multiple time series) In general, the given data for identication is a nite set
of time series w
d,1
, . . . , w
d,N
. In the presentation, however, we dene and solve the iden-
tication problems for a single time series. The generalization for multiple time series of
equal length is trivial and the one for nonequal length is an open problem.
Note 8.2 (Finite amount of data) An important aspect of the identication problems that
we address is the niteness of the available data. Previous studies of exact identication
either assume an innite amount of data or do not address the issue of niteness of the data.
Note 8.3 (Given input/output partitioning) Although the exact identication problem is
dened in the behavioral setting, most of the established results are in the input/output
setting. In our treatment, some problems are also solved in the input/output setting.
Software implementation of the algorithms presented in this and the following chapter
is described in Appendix B.3.
8.2 The Most Powerful Unfalsied Model
The notion of the most powerful unfalsied model (MPUM) is introduced in [Wil86b,
Denition 4]. It plays a fundamental role in the exact identication problem.
Denition 8.4 (MPUM in the model class L
w
[Wil86b]). The system B (R
w
)
N
is an
MPUM of the time series w
d
(R
w
)
T
in the model class L
w
if it is
1. nite dimensional LTI, i.e., B L
w
,
2. unfalsied, i.e., w
d
B[
[1,T]
, and
3. most powerful among all nite dimensional LTI unfalsied systems, i.e.,
B

L
w
and w
d
B

[
[1,T]
= B[
[1,T]
B

[
[1,T]
. i
i
i
i
118 Chapter 8. Exact Identication
The MPUM of w
d
is denoted by B
mpum
(w
d
). We skip the explicit dependence on w
d
when
w
d
is understood from the context.
The existence and uniqueness of the MPUM are proven in the following theorem.
Theorem 8.5 (Existence and uniqueness of the MPUM [Wil86b]). The MPUM of w
d

(R
w
)
T
exists and is unique. Moreover,
B
mpum
(w
d
) =
w
d
B|
[1,T]
BL
w
B;
i.e., B
mpum
(w
d
) is the smallest shift-invariant closed in the topology of pointwise
convergence subspace of (R
w
)
N
that contains w
d
.
Proof. Dene B

:=
w
d
B|
[1,T]
BL
w
B. We will show that B

is an MPUM.
Lemma 8.6 (Intersection property of L
w
). B
1
, B
2
L
w
= B
1
B
2
L
w
.
Proof. See [Wil86b, Proposition 11].
Lemma 8.6 implies that B

L
w
. Obviously, w
d
B

, so that B

is unfalsied.
Moreover, B

is in the intersection of all nite dimensional LTI unfalsied models, so that


it is most powerful. Therefore, B

is an MPUM.
We proved the existence of an MPUM. In order to prove uniqueness, assume that there
is B

,= B

that is also an MPUM of w


d
. By Lemma 8.6, B := B

L
w
and B is
obviously unfalsied. But B B

, so that B

is not an MPUM, which is a contradiction.


The next proposition shows another characterization of the MPUM for innite w
d
.
Proposition 8.7. Let w
d
(R
w
)
N
. Then
B
mpum
(w
d
) = closure
_
image(w
d
, w
d
,
2
w
d
, . . .)
_
;
i.e., B
mpum
(w
d
) is the closure of the span of w
d
and all its shifts.
Proof. Let B

:= closure
_
image(w
d
, w
d
,
2
w
d
, . . .)
_
. By denition, B

is a closed,
linear, and shift-invariant subspace. Then [Wil86a, Theorem 5] implies that B

L
w
. By
denition, w
d
B

, so that B

is unfalsied. From conditions 1 and 2 of Denition 8.4, it


is easy to see that any unfalsied model contains B

. Therefore, B

is the MPUM of w
d
.
Note 8.8 (Algorithms for construction of the MPUM) Proposition 8.7 shows that the
MPUMB
mpum
(w
d
) is explicitly constructible from the given data w
d
. However, algorithms
that pass from w
d
to concrete representations of B
mpum
(w
d
) are needed. Such algorithms
are described in Section 8.5. i
i
i
i
8.3. Identiability 119
Note 8.9 (Generically B
mpum
(w
d
) = (R
w
)
N
for innite data w
d
(R
w
)
N
) The existence
of the MPUM is guaranteed in the model class L
w
of unbounded complexity. For rough
data w
d
(R
w
)
N
(the generic case in (R
w
)
N
), the MPUMis the trivial systemB
mpum
(w
d
) =
(R
w
)
N
, i.e., a system with w inputs. Therefore, generically the MPUM of an innite time
series does not exist in a model class L
w
m
with m < w. Therefore, an approximation is needed
in order to nd a nontrivial model. Approximate identication is treated in Chapter 11.
Note 8.10 (Generically B
mpum
(w
d
)[
[1,T]
= (R
w
)
T
for nite data w
d
(R
w
)
T
)
For nite data w
d
(R
w
)
T
, the MPUMalways exists in a model class L
w
m
with any number
0 m w of inputs. For rough data the solution is still a trivial system B
mpum
(w
d
)[
[1,T]
=
(R
w
)
T
. Now, however, the possibility of tting an arbitrary T samples long time series is
achieved by the initial conditions as well as the inputs. Indeed, any observable systemB
L
w
of order n(B) p(B)T is unfalsied by any T samples long time series w
d
(R
w
)
T
.
8.3 Identiability
Not every trajectory w
d
of a system B L
w
allows the reconstruction of B from w
d
.
For example, the trajectory w
d
= 0 B does not carry any information about B because
any LTI system is compatible with the zero trajectory. The possibility of identifying B
fromw
d
is a property of both w
d
and B. In order to formalize the notion of the possibility
of identifying a system from exact data, we dene the identiability property as follows.
Denition 8.11 (Identiability). The system B (R
w
)
N
is identiable from the data
w
d
(R
w
)
T
in the model class L
w,n
m,l
if
1. B L
w,n
m,l
,
2. w
d
B[
[1,T]
, and
3. there is no other system B

L
w,n
m,l
, B

,= B, that ts the data, i.e.,


B

L
w,n
m,l
and w
d
B

[
[1,T]
= B

= B.
Identiability in L
w,n
m,l
implies that the MPUM of the data w
d
is in L
w,n
m,l
and coincides
with the data generating system B.
Theorem 8.12. If B (R
w
)
N
is identiable in the model class L
w,n
m,l
from the data w
d

(R
w
)
T
in the model class L
w,n
m,l
, then B = B
mpum
(w
d
).
Proof. The rst condition for B being identiable from w
d
implies the rst condition for
B being the MPUM of w
d
, and the second conditions are equivalent. Condition 3 for B
being identiable from w
d
implies that there is a unique unfalsied system in the model
class L
w,n
m,l
. Therefore, B is the MPUM of w
d
.
Since the MPUM is explicitly computable from the given data (see Note 8.8) identi-
ability indeed implies the possibility of identifying the system from exact data. In Sec-
tion 8.5, we list algorithms for passing fromw
d
to kernel, convolution, and input/state/output i
i
i
i
120 Chapter 8. Exact Identication
representations of the MPUM. For example, consider Algorithm8.1, which constructs a ker-
nel representation of the MPUMB
mpum
(w
d
).
Next, we dene the considered exact identication problem.
Problem 8.13 (Exact identication). Given w
d
B L
w
and complexity specication
(m, l
max
, n
max
), determine whether B is identiable from w
d
in the model class L
w,n
max
m,l
max
,
and if so, nd an algorithm that computes a representation of B.
8.4 Conditions for Identiability
The block-Hankel matrix with t
1
block rows and t
2
block columns, constructed from (in
general matrix valued) time series w =
_
w(1), w(2), . . .
_
, is denoted by
H
t
1
,t
2
(w) :=
_

_
w(1) w(2) w(3) w(t
2
)
w(2) w(3) w(4) w(t
2
+ 1)
w(3) w(4) w(5) w(t
2
+ 2)
.
.
.
.
.
.
.
.
.
.
.
.
w(t
1
) w(t
1
+ 1) w(t
1
+ 2) w(t
1
+t
2
1)
_

_
. (H )
If both block dimensions t
1
and t
2
are innite, we skip them in the notation; i.e., we dene
H (w) := H
,
(w). If the time series is nite w =
_
w(1), . . . , w(T)
_
, i.e., then H
t
1
(w)
denotes the Hankel matrix with t
1
block rows and as many block columns as the nite time
horizon T allows; i.e., H
t
1
(w) := H
t
1
,t
2
(w), where t
2
= T t
1
+ 1.
With some abuse of notation (w is viewed as both the matrix
_
w(1) w(2)

and
the vector col
_
w(1), w(2), . . .
_
, the innite Hankel matrix H (w) can be block partitioned
in the following two ways:
H (w) =
_

_
w
w

2
w
.
.
.
_

_
=
_
w w
2
w

,
which shows that it is composed of w and its shifts
t
w, t 1, stacked next to each other.
Therefore, w B implies that col span
_
H (w)
_
B. We establish conditions on w
and B under which equality holds, i.e., conditions under which w species B exactly.
Denition 8.14 (Persistency of excitation). The time series u
d
=
_
u
d
(1), . . . , u
d
(T)
_
is
persistently exciting of order L if the Hankel matrix H
L
(u
d
) is of full row rank.
Lemma 8.15 (Fundamental lemma [WRMM05]). Let
1. w
d
= (u
d
, y
d
) be a T samples long trajectory of the LTI system B, i.e.,
w
d
=
_
u
d
y
d
_
=
__
u
d
(1)
y
d
(1)
_
, . . . ,
_
u
d
(T)
y
d
(T)
__
B[
[1,T]
; i
i
i
i
8.4. Conditions for identiability 121
2. the system B be controllable; and
3. the input sequence u
d
be persistently exciting of order L +n(B).
Then any Lsamples long trajectory w = (u, y) of Bcan be written as a linear combination
of the columns of H
L
(w
d
) and any linear combination H
L
(w
d
)g, g R
TL+1
, is a
trajectory of B, i.e.,
col span
_
H
L
(w
d
)
_
= B[
[1,L]
.
Proof. See Appendix A.3.
The fundamental lemma gives conditions under which the Hankel matrix H
L
(w
d
)
has the correct image (and as a consequence the correct left kernel). For sufciently
large L, namely L l(B) + 1, it answers the identiability question.
Theorem 8.16 (Identiability conditions). The system B L
w
is identiable from the
exact data w
d
= (u
d
, y
d
) B if B is controllable and u
d
is persistently exciting of order
l(B) + 1 +n(B).
Note that for applying Theorem 8.16, we need to know a priori the order and the lag
of Band that Bis controllable. These assumptions can be relaxed as follows. Knowledge of
upper bounds n
max
and l
max
of, respectively, n(B) and l(B) sufce to verify identiability.
Moreover, the condition B controllable and u
d
persistently exciting of order l
max
+
1 + n
max
is the sharpest necessary condition for identiability that is veriable from
the data, n
max
, and l
max
only. In other words, if u
d
is not persistently exciting of order
l
max
+ 1 +n
max
, then there is a controllable system B L
w,n
max
m,l
max
, such that w
d
B and
B is not identiable from w
d
.
We will need the following corollary of the fundamental lemma.
Corollary 8.17 (Willems et al. [WRMM05]). Consider the minimal input/state/output
representation of the controllable system B, B
i/s/o
(A, B, C, D), and let x
d
be the state
sequence of B
i/s/o
(A, B,C, D), corresponding to the trajectory w
d
= (u
d
, y
d
) of B.
(i) If u
d
is persistently exciting of order n(B) + 1, then
rank
__
x
d
(1) x
d
(2) x
d
(T)
_
= n(B)
and
rank
_
u
d
(1) u
d
(T)
x
d
(1) x
d
(T)
_
= n(B) + m.
(ii) If u
d
is persistently exciting of order n(B) +L, then
rank
_
X
d
H
L
(u
d
)
_
= n(B) +Lm, where X
d
:=
_
x
d
(1) x
d
(T L + 1)

.
The rest of the chapter is devoted to the second part of the exact identication problem:
algorithms that compute a representation of the MPUM. i
i
i
i
122 Chapter 8. Exact Identication
8.5 Algorithms for Exact Identication
If the conditions of Theorem 8.16 are satised, then there are algorithms that compute a
representation of the data generating system B from the data w
d
. In fact, such algorithms
compute the MPUMof the data w
d
. In this section, we outline four classes of algorithms for
exact identication. The rst one derives a kernel representation and the second one derives
a convolution representation. Composed with realization algorithms, they give (indirect)
algorithms for the computation of state space representations. The last two classes of
algorithms construct (directly) an input/state/output representation.
Algorithms for Computation of a Kernel Representation
Under the assumption of the fundamental lemma,
ker
_
H
l
max
+1
(w
d
)
_
= B[
[0,l
max
]
.
Therefore, a basis for the left kernel of H
l
max
+1
(w
d
) denes a kernel representation of B
L
w,n
max
m,l
max
. Let
_

R
0

R
1


R
l
max

H
l
max
+1
(w
d
) = 0,
where

R
i
R
gw
with g = p(l
max
+ 1) n(B). Then
B = ker
_

R()
_
, where

R(z) =
l
max

i=0

R
i
z
i
.
This (in general nonminimal) kernel representation can be made minimal by standard
polynomial linear algebra algorithms: nd a unimodular matrix

U R
gg
[z], such that

U

R = [
R
0
], where R is full row rank. Then ker
_
R()
_
= 0 is a minimal kernel represen-
tation of B.
The above procedure is summarized in Algorithm 8.1.
Note 8.18 (Approximate identication) The SVDinstep2of Algorithm8.1is usedfor the
computation of the left kernel of the block-Hankel matrix H
l
max
+1
(w
d
). Other algorithms
can be used for the same purpose as well. The SVD, however, has an important advantage
when an approximate model is desired.
Suppose that rank
_
H
l
max
+1
(w
d
)
_
= w(l
max
+1), so that B
mpum
is the trivial model
(R
w
)
T
. Nevertheless, one can proceed heuristically with steps 5 and 6 in order to compute
a nontrivial approximate model. The parameter g can either be chosen from the decay of
the singular values (e.g., the number of singular values smaller than a user-given tolerance)
or be xed. The selection of g determines the number of inputs of the identied model and
thus its complexity. The motivation for this heuristic for approximate modeling is that U
2
spans a space that in a certain sense is an approximate left kernel of H
l
max
+1
(w
d
).
In [Wil86b, Section 15], Algorithm 8.1 is rened. An efcient recursive algorithm
for the computation of a kernel representation of the MPUM is proposed. Moreover, the
algorithm of [Wil86b] computes a shortest lag kernel representation and as a byproduct
nds an input/output partition of the variables. i
i
i
i
8.5. Algorithms for exact identication 123
Algorithm 8.1 Kernel representation of the MPUM w2r
Input: w
d
(R
w
)
T
and l
max
.
1: Compute the SVD of H
l
max
+1
(w
d
) = UV

and let r be the rank of H


l
max
+1
(w
d
).
2: if r = w(l
max
+ 1) then
3: R(z) = 0
1w
{the MPUM is the trivial model (R
w
)
T
}.
4: else
5: Let U :=
r g
_
U
1
U
2

and dene U

2
=:
_

R
0

R
1


R
l
max

, where

R
i
R
gw
.
6: Compute a unimodular matrix

U R
gg
[z], such that

U(z)
_
l
max

i=0

R
i
z
i
_
=
_
R(z)
0
_
, where R is full row rank.
7: end if
Output: R(z)a minimal kernel representation of the MPUM.
Algorithm 8.2 is an indirect algorithm for computation of an input/state/output rep-
resentation of the MPUM that uses Algorithm 8.1 for computing a kernel representation
rst. The transition from a kernel representation to an input/state/output representation is
a standard one. First, a maximal-degree, full-rank submatrix P R
pp
of R is selected
and Q is dened as the complementary to P submatrix of R. Then the left matrix fraction
description (P, Q) is realized by standard realization algorithms.
Algorithm 8.2 I/S/O representation of the MPUM via a kernel representation w2r2ss
Input: w
d
(R
w
)
T
and l
max
.
1: Compute a minimal kernel representation of the MPUM via Algorithm 8.1.
2: Select a maximal-degree, full-rank submatrix P R
pp
of R and let Q be the comple-
mentary to P submatrix of R {select an input/output partition of the variables}.
3: Realize (P, Q) via a state space system B
i/s/o
(A, B, C, D).
Output: (A, B, C, D)a minimal input/state/output representation of the MPUM.
If an input/output partition of the time series w
d
is a priori given, then step 2 is skipped.
For the computation of the transfer function P
1
(z)Q(z) of B, matrix polynomial linear
operations are needed that are not an integral part of most popular numerical linear algebra
packages and libraries such as MATLAB.
Algorithms for Computation of a Convolution Representation
The convolution representation is parameterized by the impulse response. Algorithm 8.7
from Section 8.6 computes the impulse response directly from data. This algorithm is a
consequence of the fundamental lemma with the renement that iteratively sequential pieces
of the impulse response are computed.
The impulse response is used in the algorithms for balanced model identication,
presented in Chapter 9. Previously proposed algorithms for balanced model identication i
i
i
i
124 Chapter 8. Exact Identication
compute a Hankel matrix of the Markov parameters and thus recompute most samples of
the impulse response many times. The algorithm presented in Section 8.6 avoids this and
as a result is more efcient.
Algorithm 8.3 is an indirect algorithm for computation of an input/state/output repre-
sentation of the MPUM that uses Algorithm 8.7 for computing a convolution representation
rst. The transition froma convolution representation to an input/state/output representation
is a standard problem of realization theory; see Section 8.7.
Algorithm 8.3 I/S/O representation of the MPUM via an impulse response uy2h2ss
Input: u
d
, y
d
, n
max
, and l
max
.
1: Compute the rst l
max
+ 1 + n
max
samples of the impulse response H of the MPUM
via Algorithm 8.7.
2: Compute a realization B
i/s/o
(A, B, C, D) of H via Algorithm 8.8.
Output: (A, B, C, D)a minimal input/state/output representation of the MPUM.
Algorithms Based on Computation of an Observability Matrix
Let B = B
i/s/o
(A, B, C, D). If, in addition to w
d
= (u
d
, y
d
), the extended observability
matrix O
l
max
+1
(A, C) were known, we could nd (A, B, C, D) by solving two linear
systems of equations. The rst block row of O
l
max
+1
(A, C) immediately gives C, and A
is computed from the so-called shift equation
_

O
l
max
+1
(A, C)
_
A =
_
O
l
max
+1
(A, C)
_
.
( and

, acting on a block matrix, remove, respectively, the rst and the last block rows.)
Once A and C are known, computing D, B, and the initial condition x
ini
, under which w
d
is obtained, is also a linear problem. The system of equations (see (VC))
y
d
(t) = CA
t
x
ini
+
t1

=1
CA
t1
Bu
d
() +D(t + 1), for t = 1, . . . , l
max
+ 1, (8.1)
is linear in the unknowns D, B, and x
ini
and can be solved explicitly by using Kronecker
products.
Thus the identication problem boils down to the computation of O
l
max
+1
(A, C).
Observe that the columns of O
l
max
+1
(A, C) are n(B) linearly independent free responses
of B. Moreover, any n(B) linearly independent free responses y
1
, . . . , y
n(B)
of B, stacked
next to each other, determine the extended observability matrix up to a similarity transfor-
mation. Let x
1
, . . . , x
n(B)
be the initial conditions for y
1
, . . . , y
n(B)
. The matrix
X
ini
:=
_
x
1
x
n(B)

R
n(B)n(B)
is full rank because, by assumption, the corresponding responses are linearly independent.
Then
Y
0
:=
_
y
1
y
n(B)

= O
l
max
+1
(A, C)X
ini
,
which shows that Y
0
is equivalent to O
l
max
+1
(A, C). i
i
i
i
8.5. Algorithms for exact identication 125
We have further reduced the identication problem to the problem of computing
n(B) linearly independent free responses of the MPUM. Under the assumptions of the
fundamental lemma, such responses can be computed in the same way as the one used for
the computation of the impulse response directly from data. The details are described in
Section 8.8.
Since n(B) is unknown, however, n
max
free responses y
1
, . . . , y
n
max
are computed
such that the corresponding matrix Y
0
:=
_
y
1
y
n
max

has its maximal possible


rank n(B). The matrix Y
0
in this case can be viewed as an extended observability ma-
trix O
l
max
+1
(

A,

C) for a nonminimal input/state/output representation of B with

A
R
n
max
n
max
and

C R
pn
max
. In order to nd a minimal representation, a rank revealing
factorization Y
0
= X
ini
of Y
0
is computed. The matrix is equal to O
l
max
+1
(A, C) up to
a similarity transformation. The nonuniqueness of the state space basis in which and X
ini
are obtained corresponds precisely to the nonuniqueness of the rank revealing factorization.
The procedure outlined above is summarized in Algorithm 8.4. An alternative ap-
proachfor computinga state sequence directlyfromdata, basedonthe shift-and-cut map[WR02],
is presented in [MWD05].
Algorithm 8.4 I/S/O representation of the MPUM via an observability matrix uy2o2ss
Input: u
d
, y
d
, l
max
, and n
max
.
1: Compute n
max
, l
max
+1 samples long free responses Y
0
of the MPUMvia Algorithm8.9.
2: Compute a rank revealing factorization Y
0
= X
ini
.
3: Solve the linear system of equations (

)A = () for A and dene C to be the rst


block entry of .
4: Solve the linear system of equations (8.1) for D, B, and x
ini
.
Output: (A, B, C, D)a minimal input/state/output representation of the MPUM.
Algorithms Based on Computation of a State Sequence
If a state sequence x
d
(1), . . . , x
d
(n(B) + m + 1) of an input/state/output representation of
the MPUM were known, then the parameters (A, B, C, D) could be computed by solving
the linear system of equations
_
x
d
(2) x
d
(n(B) + m + 1)
y
d
(1) y
d
(n(B) + m)
_
=
_
A B
C D
_ _
x
d
(1) x
d
(n(B) + m)
u
d
(1) u
d
(n(B) + m)
_
. (8.2)
Therefore, the identicationproblemis reducedtothe problemof computinga state sequence
of the MPUM. This can be done by computing n(B) +m+1 sequential free responses. By
sequential we mean that the corresponding sequence of initial conditions for the responses
is a valid state sequence. Under the conditions of the fundamental lemma, such responses
can be computed from data by an algorithm similar to the ones used for the computation of
the impulse response and free responses. Since n(B) is unknown, however, n
max
+ m + 1
sequential free responses should be computed. The details are described in Section 8.8.
The procedure outlined above is summarized in Algorithm 8.5. i
i
i
i
126 Chapter 8. Exact Identication
Algorithm 8.5 I/S/O representation of the MPUM via a state sequence uy2x2ss
Input: u
d
, y
d
, l
max
, and n
max
.
1: Compute n
max
, l
max
+ 1 samples long sequential free responses Y
0
of the MPUM via
Algorithm 8.9.
2: Compute a rank revealing factorization Y
0
= X
d
.
3: Solve the system of equations (8.2) for A, B, C, D, where
_
x
d
(1) x
d
(n
max
+ m + 1)

:= X
d
.
Output: (A, B, C, D)a minimal input/state/output representation of the MPUM.
8.6 Computation of the Impulse Response from Data
In this section, we consider the following problem:
Given a trajectory w
d
= (u
d
, y
d
) of a system B L
w
, nd the rst t samples of
the impulse response of B.
Under the conditions of the fundamental lemma, we have that
col span
_
H
t
(w
d
)
_
= B[
[1,t]
.
This implies that there exists a matrix G, such that H
t
(y
d
)G = H. Thus the problem
reduces to nding a particular G.
Dene U
p
, U
f
, Y
p
, and Y
f
as follows:
H
l
max
+t
(u
d
) =:
_
U
p
U
f
_
, H
l
max
+t
(y
d
) =:
_
Y
p
Y
f
_
, (8.3)
where rowdim(U
p
) = rowdim(Y
p
) = l
max
and rowdim(U
f
) = rowdim(Y
f
) = t.
Theorem 8.19 (Impulse response from data). Let w
d
= (u
d
, y
d
) be a trajectory of a
controllable LTI system B L
w,n
max
m,l
max
and let u
d
be persistently exciting of order t +
l
max
+ n
max
. Then the system of equations
_
_
U
p
U
f
Y
p
_
_
G =
_

_
0
ml
max
m
_
I
m
0
m(t1)m
_
0
pl
max
m
_

_ (8.4)
is solvable for G R
m
. Moreover, for any particular solution

G, the matrix Y
f

Gcontains
the rst t samples of the impulse response of B, i.e.,
Y
f

G = H.
Proof. Under the assumptions of the theorem, we can apply the fundamental lemma with
L = l
max
+t. Thus
col span
_
H
l
max
+t
(w
d
)
_
= B[
[1,l
max
+t]
. i
i
i
i
8.6. Computation of the impulse response from data 127
First, we show that (8.4) is solvable. The impulse response (
_
I
m
0
m(t1)m
_
, H) is a (matrix
valued) response of B obtained under zero initial conditions. Because of the zero initial
conditions, (
_
I
m
0
m(t1)m
_
, H) preceded by any number of zeros remains a response of B.
Therefore, there exists a matrix

G, such that
_

_
U
p
U
f
Y
p
Y
f
_

G =
_

_
0
ml
max
m
_
I
m
0
m(t1)m
_
0
pl
max
m
H
_

_
.
This shows that there exists a solution

Gof (8.4) and therefore Y
f

Gis the impulse response.
Conversely, let G be a solution of (8.4). We have
_

_
U
p
U
f
Y
p
Y
f
_

_
G =
_

_
0
ml
max
m
_
I
m
0
m(t1)m
_
0
pl
max
m
Y
f
G
_

_
(8.5)
and the fundamental lemma guarantees that the right-hand side of (8.5) is a response of B.
The response is identically zero during the rst l
max
samples, which (using the assumption
l
max
l(B)) guarantees that the initial conditions are set to zero. The input
_
I
m
0
m(t1)m
_
is a matrix valued impulse, so that the corresponding output Y
f
G is indeed the impulse
response H.
Theorem 8.19 gives the following block algorithm for the computation of H.
Algorithm 8.6 Block computation of the impulse response from data uy2hblk
Input: u
d
, y
d
, l
max
, and t.
1: Solve the system of equations (8.4). Let

G be the computed solution.
2: Compute H = Y
f

G.
Output: the rst t samples of the impulse response H of the MPUM.
Note 8.20 (Efcient implementation via QR factorization) The systemof equations (8.4)
of step 1 of Algorithm 8.6 can be solved efciently by rst compressing the data via the
QR factorization
_

_
U
p
U
f
Y
p
Y
f
_

= QR, R

=:
_
R
11
0 0
R
21
R
22
0
_
,
where R
11
R
jj
, j = m(l
max
+ t) + pl
max
, and then computing the pseudoinverse of
the R
11
block. We have
H = Y
f
_
_
U
p
U
f
Y
p
_
_

_
_
0
I
0
_
_
= R
21
R

11
_
_
0
I
0
_
_
. i
i
i
i
128 Chapter 8. Exact Identication
We proceed to point out an inherent limitation of Algorithm 8.6 when dealing with
nite amount of data. Let a T samples long trajectory be given. The persistency of excitation
assumptioninTheorem8.19requires that H
t+l
max
+n
max
(u
d
) be full rowrank, whichimplies
that
m(t +l
max
+n
max
) T (t +l
max
+n
max
) + 1 = t
T + 1
m + 1
l
max
n
max
.
Thus, using Algorithm8.6, we are limited in the number of samples of the impulse response
that can be computed. Moreover, for efciency and accuracy (in the presence of noise),
we want to have Hankel matrices U
p
, U
f
, etc., with many more columns than rows, which
implies small t.
In fact, according to Theorem 8.16, u
d
persistently exciting of order 1+l
max
+n
max
is sufcient for computation of the whole impulse response of the system. Indeed, this can
be done by weaving trajectories. (See Figure 8.2 for an illustration.)
Lemma 8.21 (Weaving trajectories). Consider a system B L
w
and let
1. w
d,1
be a T
1
samples long trajectory of B, i.e., w
d,1
B[
[1,T
1
]
;
2. w
d,2
be a T
2
samples long trajectory of B, i.e., w
d,2
B[
[1,T
2
]
; and
3. the last l
max
samples, where l
max
l(B), of w
d,1
coincide with the rst l
max
samples of w
d,2
, i.e.,
_
w
d,1
(T
1
l
max
+ 1), . . . , w
d,1
(T
1
)
_
=
_
w
d,2
(1), . . . , w
d,2
(l
max
)
_
.
Then the trajectory
w :=
_
w
d,1
(1), . . . , w
d,1
(T
1
), w
d,2
(l
max
+ 1), . . . , w
d,2
(T
2
)
_
(8.6)
obtained by weaving together w
d,1
and w
d,2
is a trajectory of B, i.e., w B[
[1,T
1
+T
2
l
max
]
.
Proof. Let x
d,1
:=
_
x
d,1
(1), . . . , x
d,1
(T
1
+ 1)
_
and x
d,2
:=
_
x
d,2
(1), . . . , x
d,2
(T
2
+ 1)
_
be state sequences of B associated with w
d,1
and w
d,2
, respectively. Assumption 3 implies
that x
d,1
(T
1
+ 1) = x
d,2
(l
max
+ 1). Therefore, (8.6) is a trajectory of B.
Algorithm 8.7 overcomes the above-mentioned limitation of the block algorithm by
iteratively computing blocks of L consecutive samples, where
1 L
T + 1
m + 1
l
max
n
max
. (8.7)
Moreover, monitoring the decay of H (provided the system is stable) while computing it
gives a heuristic way to determine a value for t that is sufciently large to showthe transient.
In the recursive algorithm, the matrices U
p
, U
f
, Y
p
, and Y
f
dened above are redened
as follows:
H
l
max
+L
(u
d
) =:
_
U
p
U
f
_
, H
l
max
+L
(y
d
) =:
_
Y
p
Y
f
_
,
where rowdim(U
p
) = rowdim(Y
p
) = l
max
and rowdim(U
f
) = rowdim(Y
f
) = L. i
i
i
i
8.6. Computation of the impulse response from data 129
l
max
w
d,1
B[
[1,T
1
]
w
d,2
B[
[1,T
2
]
=

w B[
[1,T
1
+T
2
l
max
]
Figure 8.2. Weaving trajectories.
Algorithm 8.7 Iterative computation of the impulse response from data uy2h
Input: u
d
, y
d
, n
max
, l
max
, and either t or a convergence tolerance .
1: Choose the number of samples L computed in one iteration step according to (8.7).
2: Initialization: k := 0, F
(0)
u
:=
_
0
ml
max
m
_
I
m
0
m(L1)m
_
_
, and F
(0)
y,p
:= 0
pl
max
.
3: repeat
4: Solve the system
_
_
U
p
U
f
Y
p
_
_
G
(k)
=
_
F
(k)
u
F
(k)
y,p
_
.
5: Compute the response H
(k)
:= F
(k)
y,f
:= Y
f
G
(k)
.
6: Dene F
(k)
y
:=
_
F
(k)
y,p
F
(k)
y,f
_
.
7: Shift F
u
and F
y
: F
(k+1)
u
:=
_

L
F
(k)
u
0
mLm
_
, F
(k+1)
y,p
:=
L
F
(k)
y
.
8: Increment the iteration counter k := k + 1.
9: until
_
kL < t if t is given,
|H
(k1)
|
F
otherwise.
Output: H = col
_
H
(0)
, . . . , H
(k1)
_
.
Proposition 8.22. Let w
d
= (u
d
, y
d
) be a trajectory of a controllable LTI system B of
order n(B) n
max
and lag l(B) l
max
, and let u
d
be persistently exciting of order
L+l
max
+n
max
. Then Algorithm 8.7 computes the rst t samples of the impulse response
of B.
Proof. Under the assumptions of the proposition, we can apply Theorem 8.19, with the
parameter t replaced by the parameter L, selected in step 1 of the algorithm(Algorithm8.6).
Steps 4 and 5 of the recursive algorithm correspond to steps 1 and 2 of the block algorithm. i
i
i
i
130 Chapter 8. Exact Identication
The right-hand side
_
F
(k)
u
F
(k)
y,p
_
of the system of equations, solved in step 4, is initialized so that
H
(0)
is indeed the matrix of the rst L samples of the impulse response.
The response computed on the (k+1)st iteration step, k 1, is a response due to zero
input and its rst l
max
samples overlap the last l
max
samples of the response computed on
the kth iteration step. By the weaving lemma (Lemma 8.21), their concatenation is a valid
response. Applying this argument recursively, we have that H computed by the algorithm
is the impulse response of the system.
With L = 1, the persistency of excitation condition required by Proposition 8.22 is
l
max
+1 +n
max
, which is the identiability condition of Theorem 8.16 (with the unknown
lag l(B) and order n(B) replaced by their given upper bounds l
max
and n
max
).
Note 8.23 (Data driven simulation) In [MWRM05], Theorem 8.19 and Algorithms 8.6
and 8.7 are modied to compute an arbitrary response directly from data. This proce-
dure is called data driven simulation and is shown to be related to deterministic subspace
identication algorithms.
Note 8.24 (Efcient implementation via QR factorization) The most expensive part of
Algorithm 8.7 is solving the system of equations in step 4. It can be solved efciently via
the QR factorization, as described in Note 8.20. Moreover, since the matrix on the left-hand
side of the system is xed, the pseudo-inverse can be computed outside the iteration loop
and used for all iterations.
8.7 Realization Theory and Algorithms
The problem of passing from an impulse response to another representation (typically in-
put/state/output or transfer function) is called realization. Given a sequence H : N R
pm
,
we say that a systemB L
w
m
, w := m+p, realizes H if Bhas a convolution representation
with an impulse response H. In this case, we say that H is realizable (by a system in the
model class L
w
m
). Asequence H might not be realizable by a nite dimensional LTI system,
but if it is realizable, the realization is unique.
Theorem 8.25 (Test for realizability). The sequence H : N R
pm
is realizable by
a nite dimensional LTI system with m inputs if and only if the two-sided innite Hankel
matrix H (H) has a nite rank. Moreover, if the rank of H (H) is n, then there is a
unique system B L
w,n
m
that realizes H.
Let H be realizable by a system B L
w
m
with an input/state/output representation
B = B
i/s/o
(A, B, C, D). We have that
H
i,j
(H) = O
i
(A, C)C
j
(A, B),
and from the properties of the controllability and observability matrices, it follows that
rank
_
H
i,j
(H)
_
=
_
min(pi, mj) for all i < (B) and j < (B),
n(B) otherwise. i
i
i
i
8.7. Realization theory and algorithms 131
Therefore, if we know that H is an impulse response of a nite dimensional LTI system B
of order n(B) n
max
and lag l(B) l
max
, where n
max
and l
max
are given, we can
nd n(B) by a rank computation as follows:
n(B) = rank
_
H
l
max
+1,n
max
(H)
_
.
This fact is often used in subspace identication. Moreover, the SVDH
t,t
(H) = UV

,
t > n
max
, allows us to nd a nite time t balanced approximation of the MPUM, so that
the numerical rank computation of the block-Hankel matrix of the Markov parameters is a
good heuristic for approximate identication.
Note 8.26 (Realization and exact identication) Clearly, realizationis a special exact iden-
tication problem. Realization of H : N R
pm
is equivalent to exact identication of the
time series
w
d,1
= (u
d,1
, y
d,1
) :=
_
col(0, e
1
), col(0, h
1
)
_
,

w
d,m
= (u
d,m
, y
d,m
) :=
_
col(0, e
m
), col(0, h
m
)
_
,
where
_
h
1
h
m

:= H, is the Kronecker delta function,


_
e
1
e
m

:= I
m
, and
the zero prex is l
max
samples long. (The zero prex xes the initial conditions to be zero,
which otherwise are free in the exact identication problem.) Special purpose realization
methods, however, are more efcient than a general exact identication algorithm.
Note 8.27 (Realization and exact identication of an autonomous system) An alterna-
tive point of view of realization is as an exact identication of an autonomous system:
realization of H : N R
pm
is equivalent to exact identication of the time series
w
d,1
= (u
d,1
, y
d,1
) := (0, h
1
), . . . , w
d,m
= (u
d,m
, y
d,m
) := (0, h
m
).
Consider the impulse response H of the system
B
i/s/o
_
A,
_
b
1
b
m

, C,
_
and the responses y
1
, . . . , y
m
of the autonomous system B
i/s/o
(A, C) due to the initial con-
ditions b
1
, . . . , b
m
. It is easy to verify that
H =
_
y
1
y
m

.
Thus, with an obvious substitution,
realization algorithms can be used for exact identication of an autonomous system
and vice versa; algorithms for identication of an autonomous systems can be used
for realization.
Once we know from Theorem 8.25 or from prior knowledge that a given time series
H :=
_
H(0), H(1), . . . , H(T)
_
is realizable in the model class L
w
m,l
max
, we can proceed
with the problem of nding a representation of the system that realizes H. General exact
identication problems can be used but in the special case at hand there are more efcient
alternatives. Algorithm 8.8 is a typical realization algorithm. i
i
i
i
132 Chapter 8. Exact Identication
Algorithm 8.8 Realization algorithm h2ss
Input: H and l
max
satisfying the conditions of Theorem 8.25.
1: Compute a rank revealing factorization of the Hankel matrix H
l
max
+1
(H) = .
2: Let D = H(0), C be the rst block row of , and B be the rst block column of .
3: Solve the shift equation (

)A = .
Output: parameters (A, B, C, D) of a minimal input/state/output realization of H.
The key computational step is the rank revealing factorization of the Hankel ma-
trix H
l
max
+1
(H). Moreover, this step determines the state basis in which the parameters
A, B, C, D are computed. In case of nite precision arithmetic, it is well known that rank
computation is a nontrivial problem. The rank revealing factorization is crucial for the
outcome of the algorithm because the rank of H
l
max
+1
(H) is the order of the realization.
When the given time series H is not realizable by an LTI system of order n
max
:=
pl
max
, i.e., H
l
max
+1
(H) is full rank, the SVD offers a possibility to nd approximate
realization in the model class L
w
m,l
max
; see also Note 8.18 on page 122. Replace the rank
revealing factorization in step 1 of Algorithm 8.8 by the SVD H
l
max
+1
(H) = UV

and the denitions := U

and :=

. This can be viewed as computation


of an approximate rank revealing factorization. Note that in this case the nite time
controllability and observability gramians are equal,

= ,
so that the computed realization B
i/s/o
(A, B, C, D) is in a nite time l
max
balanced form.
Algorithm 8.8 with the above modication is Kungs algorithm [Kun78].
8.8 Computation of Free Responses
In this section, we consider the following problem:
Given w
d
= (u
d
, y
d
) B, nd (sequential) free responses Y
0
of B.
By sequential, we mean that the initial conditions corresponding to the columns of Y
0
form a valid state sequence of B.
First, we consider computation of general free responses. Using the fundamental
lemma, a set of t samples long free responses can be computed from data as follows:
_
H
t
(u
d
)
H
t
(y
d
)
_
G =
_
0
Y
0
_
. (8.8)
Therefore, for any G that satises H
t
(u
d
)G = 0, the columns of Y
0
:= H
t
(y
d
)G are free
responses. The columns of G are vectors in the null space of H
t
(u
d
) and can be computed
explicitly; however, in general, rank(Y
0
) n(B). The condition rank(Y
0
) = n(B) is
needed for identication of an input/state/output representation of the MPUM, as outlined
in Algorithm 8.3.
In order to ensure the rank condition, we use the splitting of the data into past and
future as dened in (8.3). The blocks in the past allow us to restrict the matrix G, so that i
i
i
i
8.9. Relation to subspace identication methods

133
the initial conditions X
ini
under which the responses Y
0
are generated satisfy rank(X
ini
) =
n(B). This implies rank(Y
0
) = n(B). In turns out, however, that in choosing the initial
conditions X
ini
, we can furthermore produce sequential free responses.
Using the fundamental lemma, we know that the right-hand side of the equation
_

_
U
p
U
f
Y
p
Y
f
_

_
G =
_

_
U
p
0
Y
p
Y
0
_

_
is a trajectory. Therefore, a set of free responses can be computed from data by solving the
system of equations
_
_
U
p
U
f
Y
p
_
_
G =
_
_
U
p
0
Y
p
_
_
(8.9)
and setting Y
0
= Y
f
G. Moreover, the Hankel structure of U
p
and Y
p
imply that Y
0
is a
matrix of sequential responses. System (8.9) and Y
0
= Y
f
G give a block algorithm for
the computation of sequential free responses. It is analogous to the block algorithm for the
computation of the impulse response and again the computation can be performed efciently
via the QR factorization.
We proceed to present a recursive algorithm for the computation of Y
0
, analogous to
Algorithm 8.7 for the computation of the impulse response. An advantage of the recursive
algorithm over the block one is that one is not restricted by the nite amount of data w
d
to
a nite length responses Y
0
.
Proposition 8.28. Under the assumptions of Proposition 8.22, Algorithm 8.9 computes a
matrix of sequential free responses of B with t block rows.
Proof. This is similar to the proof of Proposition 8.22.
8.9 Relation to Subspace Identication Methods

MOESP-Type Algorithms
The multivariable output error state space (MOESP)-type subspace identication algorithms
correspond to the algorithm based on the computation of free responses as outlined in
Section 8.5, Algorithm 8.4. However, in the MOESP algorithms, step 1the computation
of free responsesis implemented via the orthogonal projection
Y
0
:= H
l
max
+1
(y
d
)
_
I H

l
max
+1
(u
d
)
_
H
l
max
+1
(u
d
)H

l
max
+1
(u
d
)
_
1
H
l
max
+1
(u
d
)
_
. .

u
d
;
(8.10)
i.e., the MOESP algorithms compute the orthogonal projection of the rows of H
l
max
+1
(y
d
)
on the orthogonal complement of the row space of H
l
max
+1
(u
d
). In subspace identication
it is customary to think in terms of geometric operations: projection of the rows of a certain i
i
i
i
134 Chapter 8. Exact Identication
Algorithm 8.9 Iterative computation of sequential free responses uy2y0
Input: u
d
, y
d
, n
max
, l
max
, and either the desired number of samples t or a convergence
tolerance .
1: Choose the number of samples L computed in one iteration step according to (8.7).
2: Initialization: k := 0, F
(0)
u
:=
_
U
p
0
_
, and F
(0)
y,p
:= Y
p
.
3: repeat
4: Solve the system
_
_
U
p
U
f
Y
p
_
_
G
(k)
=
_
F
(k)
u
F
(k)
y,p
_
.
5: Compute the response Y
(k)
0
:= F
(k)
y,f
:= Y
f
G
(k)
.
6: Dene F
(k)
y
:=
_
F
(k)
y,p
F
(k)
y,f
_
.
7: Shift F
u
and F
y
: F
(k+1)
u
:=
_

L
F
(k)
u
0
mLm
_
and F
(k+1)
y,p
:=
L
F
(k)
y
.
8: Increment the iteration counter k := k + 1.
9: until
_
kL < t if t is given,
|Y
(k1)
0
|
F
otherwise.
Output: Y
0
= col
_
Y
(0)
0
, . . . , Y
(k1)
0
_
.
matrix onto the row space of another matrix. The fact that these matrices have special
(block-Hankel) structure is ignored and the link with systemtheory is lost. Still, as we show
next,
the orthogonal projection (8.10) has the simple and useful system theoretic inter-
pretation of computing a maximal number of free responses.
Observe that
_
H
l
max
+1
(u
d
)
H
l
max
+1
(y
d
)
_

u
d
=
_
0
Y
0
_
,
which corresponds to (8.8) except that now the projector

u
d
is a square matrix, while
in (8.8) G is in general a rectangular matrix. In [VD92, Section 3.3], it is shown that a
sufcient condition for rank(Y
0
) = n(B) is
rank
__
X
ini
H
l
max
+1
(u
d
)
__
= n(B) + (l
max
+ 1)m. (8.11)
This condition, however, is not veriable from the data w
d
= (u
d
, y
d
). Therefore, given w
d
,
one cannot check in general whether the data generating system B is identiable by the
MOESP algorithms. Under the identiability condition
u
d
persistently exciting of order l
max
+ 1 + n
max
,
which is veriable from the data, Corollary 8.17 implies (8.11). i
i
i
i
8.9. Relation to subspace identication methods

135
Finally, note that the j = T l
max
free responses that the orthogonal projection (8.10)
computes are typically more than necessary for exact identication, i.e., j n(B). There-
fore, in general, the orthogonal projection is a computationally inefcient operation for
exact identication. This deciency of the MOESP algorithms is partially corrected on the
level of the numerical implementation. First, the QR factorization
_
H
n
max
(u
d
)
H
n
max
(y
d
)
_

= QR
is computed and then only the block entry R
22
of the R factor is used, where
R

=:
n
max
m n
max
p
_
R
11
0 0
R
21
R
22
0
_
n
max
m
n
max
p
.
It can be shown (see [VD92, Section 4.1]) that
col span(Y
0
) = col span(R
22
).
The column dimension of R
22
is n
max
p, which is (typically) comparable with n
max
and is
(typically) much smaller than j.
N4SID-Type Algorithms
The numerical algorithms for subspace state space systemidentication(N4SID) correspond
to the algorithm based on the computation of a state sequence as outlined in Section 8.5,
Algorithm 8.5. However, in the N4SID-type algorithms, step 1the computation of se-
quential free responsesis implemented via the oblique projection. Consider the splitting
of the data into past and future,
H
2(l
max
+1)
(u
d
) =:
_
U
p
U
f
_
, H
2(l
max
+1)
(y
d
) =:
_
Y
p
Y
f
_
, (8.12)
with rowdim(U
p
) = rowdim(U
f
) = rowdim(Y
p
) = rowdim(Y
f
) = l
max
+ 1, and let
W
p
:=
_
U
p
Y
p
_
.
As the key computational step of the MOESP algorithms is the orthogonal projection, the
key computational step of the N4SID algorithms is the oblique projection of Y
f
along the
space spanned by the rows of U
f
onto the space spanned by the rows of W
p
. This geometric
operation, denoted by Y
f
/
U
f
W
p
, is dened as follows (see [VD96, equation (1.4), page 21]):
Y
0
:= Y
f
/
U
f
W
p
:= Y
f
_
W

p
U
f

_
W
p
W

p
W
p
U

f
U
f
W

p
U
f
U
f

_
+
_
W
p
0
_
. .

obl
. (8.13)
Next, we show that i
i
i
i
136 Chapter 8. Exact Identication
the oblique projection computes sequential free responses of the system.
Note that
_
_
W
p
U
f
Y
f
_
_

obl
=
_
_
W
p
0
Y
0
_
_
corresponds to (8.9) except that the oblique projector
obl
is a square matrix, while in (8.9),
G is in general rectangular. Therefore, the columns of the oblique projection Y
0
given
in (8.13) are j := T 2l
max
1 sequential free responses. However, as with the orthogonal
projection, the oblique projection also computes in general more responses than the n
max
+
m + 2 ones needed for applying Algorithm 8.5.
In [VD96, Section 2, Theorem 2], it is (implicitly) proven that a sufcient condition
for rank(X
d
) = n(B), which is needed for the exact identication Algorithm 8.5, is
1. u
d
persistently exciting of order 2n
max
and
2. rowspan(X
d
) rowspan(U
f
) = 0;
see assumptions 1 and 2 of [VD96, Section 2, Theorem 2]. As with assumption (8.11)
in the MOESP algorithms, however, assumption 2 is again not veriable from the given
data. Persistency of excitation of u
d
of order 2(l
max
+ 1) + n(B) (i.e., the assumption
of the fundamental lemma) is a sufcient condition, veriable from the data (u
d
, y
d
), for
assumptions 1 and 2 of [VD96, Section 2, Theorem 2].
8.10 Simulation Examples
Impulse Response Computation
First we consider the problem of computing the rst t samples of the impulse response H
of a system B from data w
d
:= (u
d
, y
d
). We choose a random stable system B of order
n = 4 with m = 2 inputs and p = 2 outputs. The data w
d
is obtained according to the
EIV model w
d
= w + w, where w := ( u, y) B[
[1,T]
with T = 500, u is zero mean unit
variance white Gaussian noise, and w is a zero mean white Gaussian noise with variance

2
. Varying , we study empirically the effect of random perturbation on the results.
We apply Algorithm 8.6 with t = 27, n
max
= n, and l
max
= n
max
/p. The
computed impulse response is denoted by

H and is compared with the true impulse
response H obtained from B by simulation. The comparison is in terms of the Frobenius
norm e = [[H

H[[
F
of the approximation error H

H. We also apply Algorithm 8.7
with parameters n
max
= n, l
max
= n
max
/p, L = 12, and the function impulse from
the System Identication Toolbox of MATLAB that estimates impulse response from data.
Table 8.1 shows the approximation errors e and execution times for four different noise
levels and for the three compared algorithms. (The efciency is measured by the execution
time and not by the oating point operations (ops) because the function impulse is
available only in the latter versions of MATLAB that do not support op counts.)
Inthe absence of noise, bothAlgorithm8.6andAlgorithm8.7compute uptonumerical
errors exactly the impulse response H, while the same is not true for the function impulse.
The simulation results show that the iterative algorithm is faster than the block algorithm. i
i
i
i
8.10. Simulation examples 137
Table 8.1. Error of approximation e = [[H

H[[
F
and execution time in seconds for
Algorithm 8.6, Algorithm 8.7 with L = 12, and the function impulse.
= 0.0 = 0.01 = 0.05 = 0.1
Method e time, s e time, s e time, s e time, s
Algorithm 8.6 10
14
0.293 0.029 0.277 0.096 0.285 0.251 0.279
Algorithm 8.7 10
14
0.066 0.023 0.086 0.066 0.068 0.201 0.087
impulse 0.059 0.584 0.067 0.546 0.109 0.573 0.249 0.558
5 10 15 20 25 30
0
5
10
15
20
25
L
m
e
g
a

o
p
s
Figure 8.3. Number of ops as a function of the parameter L.
Also, whenthe givendata w
d
is noisy, the iterative algorithmoutperforms the blockalgorithm
and the function impulse.
Next, we show the effect of the parameter L on the number of ops and the error of
approximation e. The plot in Figure 8.3 shows the number of ops, measured in megaops,
as a function of L. The function is monotonically increasing, so that most efcient is the
computation for L = 1. The plots in Figure 8.4 showthe approximation error e as a function
of L for four different noise levels. The results are averaged for 100 noise realizations. The
function e(t) is complicated and is likely to depend on many factors. The graphs, however,
indicate that in the presence of noise, there is a trade-off between computational efciency
and approximation error. For small L the computational cost is small, but the error e tends
to be large.
Comparison of Exact Identication Algorithms
We compare the numerical efciency of the following algorithms for deterministic identi-
cation:
uy2ssmr Algorithm of Moonen and Ramos [MR93]; see Algorithm 9.6;
uy2ssvd Algorithm of Van Overschee and De Moor [VD96]; see Algorithm 9.5;
Deterministic algorithm 1 of Section 2.4.1 in [VD96] is combined with the
choice of the weight matrices W
1
and W
2
given in Theorem 13, Section 5.4.1.
Our implementation, however, differs from the outline of the algorithms given
in [VD96]; see Note 9.4 on page 145; i
i
i
i
138 Chapter 8. Exact Identication
5 10 15 20 25 30
0
0.5
1
1.5
2
2.5
3
L
e
= 0
10
14
5 10 15 20 25 30
0.022

0.027
0.032
L
e
= 0.01
5 10 15 20 25 30
0.2
0.4
L
e
= 0.05
5 10 15 20 25 30
0.2
0.7
1.2
L
e
= 0.1
Figure 8.4. Error of approximation e = [[H

H[[
F
as a function of the parameter L for
different noise levels .
det_stat Deterministic algorithm 1 of [VD96, Section 2.4.1]l;
(implementation det_stat.m supplementing the book)
det_alt Deterministic algorithm 2 of [VD96, Section 2.4.2];
(implementation det_alt.m supplementing the book);
projec Projection algorithm of [VD96, Section 2.3.1];
(implementation projec.m supplementing the book);
intersec Intersection algorithm of [VD96, Section 2.3.2];
(implementation intersec.m supplementing the book);
moesp A deterministic version of the MOESP algorithm;
uy2ssbal The algorithm for deterministic balanced subspace identication proposed in
Chapter 9 (with parameter L = 1); see Algorithm 9.4;
uy2h2ss Algorithm 8.7 (with L = 1) applied for the computation of the rst 2n
max
+
1 samples of the impulse response, followed by Kungs algorithm for the
realization of the impulse response; see Algorithm 8.3.
For the experiments we generate a randomstable nth order systemBwith m = 2 inputs
and p = 2 outputs. The input is T samples long, zero mean unit variance white Gaussian
sequence and the initial condition x
ini
is a zero mean random vector. We assume that the i
i
i
i
8.11. Conclusions 139
100 200 300 400 500
0
0.2
0.4
0.6
0.8
1
1.2
1.4
n = 4
T
m
e
g
a

o
p
s
uy2ssmr
uy2ssvd
det_stat
det_alt
moesp
uy2ssbal
projec
intersec
uy2h2ss
Figure 8.5. Number of ops for the algorithms as a function of the length T of the given
time series.
true order is known; i.e., n
max
= n and l
max
is selected as n
max
/p. The parameter i
(the number of block rows of the Hankel matrix constructed from data) in the subspace
identication algorithms is selected as i = n
max
/p.
First, we illustrate the amount of work (measured in megaops) for the compared
algorithms as a function of T; see Figure 8.5. The order is chosen as n = 4 and T varies
from 100 to 500. The computational complexity of all compared algorithms is linear in T
but with different initial cost and different slope. The initial cost and the slope are smallest
(almost the same) for uy2ssbal and uy2h2ss.
The second experiment shows the ops for the compared algorithms as a function
of the system order n; see Figure 8.6. The length T of the given time series is chosen
as 50 and the order n is varied from 1 to 18. We deliberately choose T small to show
the limitations of the algorithms to identify a system from a nite amount of data. At a
certain value of n, the graphs in Figure 8.6 stop. The value of n where a graph stops is the
highest possible order of a system that the corresponding algorithm can identify from the
given T = 50 data points. (At higher values of n, the algorithm either exits with an error
message or gives a wrong result.) The ops as a function of n are quadratic for all compared
algorithms but again the actual number of ops depends on the implementation. Again,
most efcient are uy2ssbal and uy2h2ss. Also, they outperform all other methods
except moesp in the ability to identify a (high order) system from (small) amount of data.
This is a consequence of the fact that Algorithm 8.7 is more parsimonious in the persistency
of excitation assumption than Algorithm 8.6.
8.11 Conclusions
We have presented theory and algorithms for exact identication of LTI systems. Although
the exact identication problem is not a realistic identication problem, it is interesting and
nontrivial from theoretic and algorithmic points of view. In addition, it is an ingredient and i
i
i
i
140 Chapter 8. Exact Identication
2 4 6 8 10 12 14 16 18
0
0.5
1
1.5
2
m
e
g
a

o
p
s
n
T = 50
uy2ssmr
uy2ssvd
det_stat
det_alt
moesp
uy2ssbal
projec
intersec
uy2h2ss
Figure 8.6. Number of ops for the algorithms as a function of the order n of the system.
prerequisite for proper understanding of other more complicated and realistic identication
problems incorporating uncertainty.
The main result is the answer to the identiability question; Under what conditions,
veriable from w
d
, does the MPUM coincide with the data generating system? Once this
question is answered positively, one can consider algorithms for passing from the data to a
representation of the unknown system. In fact, the algorithms compute a representation of
the MPUM.
We have presented algorithms for exact identication aiming at kernel, convolution,
and input/state/output representations. The latter ones were analyzed in the most detail. We
showed links and a new interpretation of the classical MOESP and N4SID deterministic
subspace identication algorithms. i
i
i
i
Chapter 9
Balanced Model
Identication
In this chapter, algorithms for identication of a balanced state space representation are
considered. They are based on the algorithms for computation of the impulse response and
sequential zero input responses presented in Chapter 8. The proposed algorithms are more
efcient than the existing alternatives that compute the whole Hankel matrix of Markov
parameters. Moreover, using a nite amount of data, the existing algorithms compute a
nite time balanced representation, and the identied models have a lower bound on the
distance from an exact balanced representation. The proposed algorithms can approximate
arbitrarily closely an exact balanced representation. The nite time balancing parameter
can be selected automatically by monitoring the decay of the impulse response. We show
what is optimal in terms of the minimal identiability condition partitioning of the data into
past and future for deterministic subspace identication.
9.1 Introduction
In this chapter, we consider the following deterministic identication problem:
Givena T samples longinput/output trajectoryw
d
= (u
d
, y
d
) of anLTI systemB
L
w,n
max
m,l
max
, determine a balanced input/state/output representation
B = B
i/s/o
(A
bal
, B
bal
, C
bal
, D
bal
)
of the system, i.e., a representation such that
O

(A
bal
, C
bal
)O(A
bal
, C
bal
) = C(A
bal
, B
bal
)C

(A
bal
, B
bal
) = ,
where = diag(
1
, . . . ,
n(B)
) and
1

1

n(B)
.
The problemis to nd conditions and algorithms to construct (A
bal
, B
bal
, C
bal
, D
bal
) directly
from w
d
. Equivalently, we want to nd a balanced input/state/output representation of the
MPUM.
141 i
i
i
i
142 Chapter 9. Balanced Model Identication
Algorithm 9.1 Balanced identication via a state sequence uy2ssbal
Input: u
d
, y
d
, n
max
, l
max
, and > n
max
.
1: Compute the rst 2 samples H of the impulse response matrix of B.
2: Compute n
max
+ m + 1, samples long sequential free responses Y
0
of B.
3: Compute the SVD, H = UV

, of the block-Hankel matrix H = H

(H).
4: Compute the balanced state sequence X
bal
:=

1
U

Y
0
,
X
bal
=
_
x
bal
(n
max
+ 1) x
bal
(2n
max
+ 2 + m)

.
5: Compute the balanced realization A
bal
, B
bal
, C
bal
, D
bal
by solving the linear system of
equations
_
x
bal
(n
max
+ 2) x
bal
(2n
max
+ 2 + m)
y
d
(n
max
+ 1) y
d
(2n
max
+ 1 + m)
_
=
_
A
bal
B
bal
C
bal
D
bal
_ _
x
bal
(n
max
+ 1) x
bal
(2n
max
+ 1 + m)
u
d
(n
max
+ 1) u
d
(2n
max
+ 1 + m)
_
. (9.1)
Output: A
bal
, B
bal
, C
bal
, D
bal
.
Although the assumption that w
d
is exact is mainly of theoretical importance, solving
the exact identication problem is a prerequisite for the study of the realistic approximate
identication problem, where w
d
is approximated by a trajectory w of an LTI system. In a
balanced basis, one can apply truncation as a very effective heuristic for model reduction,
which yields a method for approximate identication.
The balanced state space identication problem is studied in [MR93] and [VD96,
Chapter 5]. The proposed algorithms t the outline of Algorithm 9.1.
In [MR93, VD96], it is not mentioned that the Hankel matrix of Markov parame-
ters H

(H) is computed. Also, in [MR93], it is not mentioned that the matrix Y


0
of
sequential zero input responses is computed. In this chapter, we interpret these algorithms
as implementations of Algorithm 9.1 and reveal their structure.
Note 9.1 (Finite time- balancing) The basic algorithm factors a nite block-
Hankel matrix of Markov parameters H, so that the obtained representation (A
bal
, B
bal
, C
bal
,
D
bal
) is nite time- balanced. For large , the representation obtained is close to an
innite time balanced one. Determining an appropriate value for the parameter , however,
is a problem in its own right and is addressed here. The important difference among the
algorithms of [MR93], [VD96], and the ones proposed here is the method of computing the
matrix Y
0
and the impulse response H.
Note 9.2 (Model reduction) Identication of a state space model in a balanced basis is
motivated by the effective heuristic for model reduction by truncation in that basis. In
principle it is possible to identify the model in any basis and then apply standard algorithms
for state transformation to a balanced basis. The direct algorithms discussed in this chapter,
however, have the advantage over the indirect approach that they allow us to identify a
reduced order model directly from data without ever computing a full order model. i
i
i
i
9.1. Introduction 143
Algorithm 9.2 Balanced identication via the impulse response uy2h2ss
Input: u
d
, y
d
, n
max
, and > n
max
.
1: Find the rst 2 samples H(0), . . . , H(2 1) of the impulse response of B and let
H := col
_
H(0), . . . , H(21)
_
.
2: Compute the SVD, H = UV

, of the block-Hankel matrix of Markov parameters H =


H

(H) R
pm
.
3: Dene O
bal
:= U

and C
bal
:=

.
4: Let D
bal
= H(0), B
bal
be equal to the rst m columns of C
bal
(the rst block column),
C
bal
be equal to the rst p rows of O
bal
(the rst block row), and A
bal
be the solution of
the shift equation (

O
bal
)A
bal
= O
bal
.
Output: A
bal
, B
bal
, C
bal
, D
bal
.
The model reduction can be done in step 5 of Algorithm 9.1. Let r be the desired
order of the reduced model and let X
red
be the truncated to the rst r rows balanced state
sequence X
bal
. As a heuristic model reduction procedure, we derive the reduced model
parameters by solving the least squares problem
_
x
red
(n
max
+ 2) x
red
(2n
max
+ 2 + m)
y
d
(n
max
+ 1) y
d
(2n
max
+ 1 + m)
_
=
_
A
red
B
red
C
red
D
red
_ _
x
red
(n
max
+ 1) x
red
(2n
max
+ 1 + m)
u
d
(n
max
+ 1) u
d
(2n
max
+ 1 + m)
_
in place of the exact system of equations (9.1). The obtained model (A
red
, B
red
, C
red
, D
red
)
is not the same as the model obtained by truncation of the (nite time-) balanced model.
In particular, we do not knowabout error bounds similar to the ones available for the (innite
time) balanced model reduction.
Step 1, computation of the impulse response, is the crucial one. Once H is computed,
a balanced model can be obtained directly via Kungs algorithm. This gives the alternative
deterministic balanced model identication algorithm, outlined in Algorithm 9.2.
In Algorithm 9.2, once the impulse response is computed, the parameters A
bal
, B
bal
,
C
bal
, and D
bal
are obtained without returning to the original observed data. Yet another
alternative for computing a balanced representation directly from data is to obtain the pa-
rameters A
bal
and C
bal
as in Algorithm 9.2 from O
bal
and the parameters B
bal
and D
bal
(as
well as the initial condition x
bal
(1), under which w
d
is obtained) from the linear system of
equations
y
d
(t) = C
bal
A
t
bal
x
bal
(1) +
t1

=1
C
bal
A
t1
bal
B
bal
u
d
() +D
bal
(t + 1), for t = 1, . . . , T,
(9.2)
using the original data. (By using Kronecker products, (9.2) can be solved explicitly.) The
resulting Algorithm 9.3 is in the spirit of the MOESP-type algorithms.
Simulation results show that in the presence of noise, going back to the data, as
done in Algorithms 9.1 and 9.3, leads to more accurate results. This gives an indication that
Algorithms 9.1 and 9.3 might be superior to Algorithm 9.2. i
i
i
i
144 Chapter 9. Balanced Model Identication
Algorithm 9.3 Balanced identication via an observability matrix uy2h2o2ss
Input: u
d
, y
d
, n
max
, and > n
max
.
1: Find the rst 2 samples H(0), . . . , H(21) of the impulse response of the MPUM
and let H := col
_
H(0), . . . , H(21)
_
.
2: Compute the SVD, H = UV

, of the block-Hankel matrix of Markov parameters H =


H

(H) R
pm
.
3: Dene O
bal
:= U

.
4: Let C
bal
be equal to the rst p rows of O
bal
(the rst block row) and A
bal
be the solution
of the shift equation (

O
bal
)A
bal
= O
bal
.
5: Solve the system of equations (9.2) for B
bal
, D
bal
, and x
bal
(1).
Output: A
bal
, B
bal
, C
bal
, D
bal
.
9.2 Algorithm for Balanced Identication
In Chapter 8, we specied steps 1 and 2 of Algorithm 9.1. Steps 3, 4, and 5 follow from
standard derivations, which we now detail. Let H be the Hankel matrix of the Markov
parameters H := H

(H). By factoring H into O


bal
and C
bal
via the restricted SVD
H = UV

= U

. .
O
bal

. .
C
bal
,
we obtain an extended observability matrix O
bal
= O

(A
bal
, C
bal
) and a corresponding
extended controllability matrix C
bal
= C

(A
bal
, B
bal
) in a nite time balanced basis. The
basis is nite time- balanced, because the nite time- observability gramian O

bal
O
bal
=
and the nite time- controllability gramian C
bal
C

bal
= are equal and diagonal.
The matrix of sequential zero input responses Y
0
can be written as Y
0
= X for a
certain extended observability matrix and a state sequence X in the same basis. We nd
the balanced state sequence
X
bal
:=
_
x
bal
(l
max
+ 1) x
bal
(l
max
+ 1 + n
max
+ m)

corresponding to O
bal
= U

from
Y
0
= O
bal
X
bal
= X
bal
=

1
U

Y
0
.
The corresponding balanced representation (A
bal
, B
bal
, C
bal
, D
bal
) is computed from the
system of equations
_
x
bal
(l
max
+ 2) x
bal
(l
max
+ 2 + n
max
+ m)
y
d
(l
max
+ 1) y
d
(l
max
+ 1 + n
max
+ m)
_
=
_
A
bal
B
bal
C
bal
D
bal
_ _
x
bal
(l
max
+ 1) x
bal
(l
max
+ 1 + n
max
+ m)
u
d
(l
max
+ 1) u
d
(l
max
+ 1 + n
max
+ m)
_
. (9.3)
This yields Algorithm 9.4.
The preceding presentation in this section and Propositions 8.22 and 8.28 prove the
following main result. i
i
i
i
9.3. Alternative algorithms 145
Algorithm 9.4 Algorithm for balanced subspace identication uy2ssbal
Input: u
d
, y
d
, n
max
, l
max
, and either or a convergence tolerance .
1: Apply Algorithm 8.9 with inputs u
d
, y
d
, n
max
, l
max
, L, and , in order to compute the
sequential zero input responses Y
0
.
2: Apply Algorithm 8.7 with inputs u
d
, y
d
, n
max
, l
max
, and either or , in order to
compute the impulse response H and, if not given, the parameter .
3: Form the Hankel matrix H := H

(H) and compute the SVD H = UV

.
4: Compute a balanced state sequence X
bal
=

1
U

Y
0
.
5: Compute a balanced representation by solving (9.3).
Output: A
bal
, B
bal
, C
bal
, D
bal
, and .
Algorithm 9.5 Algorithm of Van Overschee and De Moor uy2ssvd
Input: u
d
, y
d
, and a parameter i.
Dene:
_
U
p
U
f
_
:= H
2i
(u
d
), where rowdim(U
p
) = i, and
_
Y
p
Y
f
_
:= H
2i
(y
d
), where
rowdim(Y
p
) = i.
Compute the weight matrix W := U

p
(U
p
U

p
)
1
J, where J is the leftright ipped
identity matrix.
1: Compute the oblique projection Y
0
:= Y
f
/
U
f
_
U
p
Y
p
_
; see (8.13).
2: Compute the matrix

H := Y
0
W.
3: Compute the SVD

H = UV

.
4: Compute a balanced state sequence X
bal
=

1
U

Y
0
.
5: Compute a balanced representation by solving (9.3).
Output: A
bal
, B
bal
, C
bal
, D
bal
.
Theorem 9.3. Let w
d
= (u
d
, y
d
) be a trajectory of a controllable LTI system B of order
n(B) n
max
and lag l(B) l
max
, and let u
d
be persistently exciting of order L+l
max
+
n
max
. Then (A
bal
, B
bal
, C
bal
, D
bal
) computed by Algorithm 9.4 is a nite time- balanced
representation of B.
9.3 Alternative Algorithms
We outline the algorithms of Van Overschee and De Moor [VD96] and Moonen and
Ramos [MR93].
Note 9.4 (Weight matrix W) The weight matrix W is different from the one in [VD96].
In terms of the nal result

H, however, it is equivalent. Another difference between Algo-
rithm 9.5 and the deterministic balanced subspace algorithm of [VD96] is that the shifted
state sequence appearing on the left-hand side of (9.3) is recomputed in [VD96] by another
oblique projection.
In the algorithms of Van Overschee and De Moor and Moonen and Ramos, the pa-
rameter i plays the role of the nite time balancing parameter . Note that i is given and
the past and the future are taken with equal length i. i
i
i
i
146 Chapter 9. Balanced Model Identication
Algorithm 9.6 Algorithm of Moonen and Ramos uy2ssmr
Input: u
d
, y
d
, and a parameter i.
Dene:
_
U
p
U
f
_
:= H
2i
(u
d
), where rowdim(U
p
) = i, and
_
Y
p
Y
f
_
:= H
2i
(y
d
), where
rowdim(Y
p
) = i.
Compute a matrix [T
1
T
2
T
3
T
4
], whose rows form a basis for the left kernel of
_
U
p
Y
p
U
f
Y
f
_
.
1: Compute a matrix of zero input responses Y
0
= T

4
[T
1
T
2
]
_
U
p
Y
p
_
.
2: Compute the Hankel matrix of Markov parameters H = T

4
(T
2
T

4
T
3
T
1
)J.
3: Compute the SVD, H = UV

.
4: Compute a balanced state sequence X
bal
=

1
U

Y
0
.
5: Compute a balanced representation by solving (9.3).
Output: A
bal
, B
bal
, C
bal
, D
bal
.
Both Algorithm 9.5 and Algorithm 9.6 t the outline of Algorithm 9.1, but steps
1 and 2 are implemented in rather different ways. As shown in Section 8.8, the oblique
projection Y
f
/
U
f
_
U
p
Y
p
_
is a matrix of sequential zero input responses. The weight matrix W
in the algorithm of Van Overschee and De Moor is constructed so that

H = Y
0
W is an
approximation of the Hankel matrix of Markov parameters H; it is the sum of H and a
matrix of zero input responses.
The most expensive computation in the algorithm of Moonen and Ramos is the com-
putation of the annihilators
_
T
1
T
4

. The matrix [T
1
T
2
]
_
U
p
Y
p
_
is a nonminimal state
sequence (the shift-and-cut operator [RW97]) and T

4
is a corresponding extended observ-
ability matrix. Thus T

4
[T
1
T
2
]
_
U
p
Y
p
_
is a matrix of sequential zero input responses. It turns
out that (T
2
T

4
T
3
T
1
)J is an extended controllability matrix (in the same basis), so that
T

4
(T
2
T

4
T
3
T
1
)J is the Hankel matrix of Markov parameters H.
A major difference between the proposed Algorithm 9.4, on one hand, and the algo-
rithms of Van Overschee and De Moor and Moonen and Ramos, on the other hand, is that
in Algorithm 9.4 the Hankel matrix H is not computed but constructed from the impulse
response that parameterizes it. This is a big computational saving because recomputing the
same elements of H is avoided. In addition, in approximate identication, where w
d
is not
a trajectory of B, the matrices

H and H computed by the algorithms of Van Overschee and
De Moor and Moonen and Ramos are in general no longer Hankel, while the matrix H in
Algorithm 9.4 is by construction Hankel.
9.4 Splitting of the Data into Past and Future

In the algorithms of Moonen and Ramos and Van Overschee and De Moor, the block-Hankel
matrices
_
U
p
U
f
_
and
_
Y
p
Y
f
_
are split into past and future of equal length. Natural questions
are why is this necessary and furthermore what is optimal according to certain relevant
criteria partitionings. These questions have been open for a long time, in particular in the i
i
i
i
9.5. Simulation examples 147
context of the stochastic identication problem; see [DM03].
In Chapter 8, we showed that the past U
p
, Y
p
is used to assign the initial conditions
and the future U
f
, Y
f
is used to compute a response. By weaving consecutive segments of
the response, as done in Algorithms 8.7 and 8.9, the number of block rows in the future
does not need to be equal to the required length of the response. Thus from the perspective
of deterministic identication, the answer to the above question is as follows:
rowdim(U
p
) = rowdim(Y
p
) = l
max
, i.e., the given least upper bound on
the system lag l(B), and rowdim(U
f
) = rowdim(Y
f
) 1, . . . , l
max
+
n
max
, where is the order of persistency of excitation of the input u
d
.
By using the iterative algorithms for computation of the impulse response and sequential free
responses with parameter L = 1, Algorithms 9.2, 9.3, and 9.4 require the same assumption
as the identiability assumption of Theorem 8.16, so that the partitioning past = l
max
and
future = 1 is consistent with our previous results.
Using the fundamental lemma, we can prove the following result.
Proposition 9.5. Let w
d
= (u
d
, y
d
) be a trajectory of a controllable LTI system B
L
w,n
max
m,i
, and let u
d
be persistently exciting of order 2i + n
max
. Then the representations
computed by Algorithms 9.5 and 9.6 are equivalent to B. Moreover, the representation
computed by Algorithm 9.6 is in a nite time-i balanced basis.
Proposition 9.5 shows that Algorithms 9.5 and 9.6 are not parsimonious with respect
to the available data. In particular, the system B can be identiable with Algorithms 9.2,
9.3, and 9.4 but not with Algorithms 9.5 and 9.6.
Note that the persistency of excitation required by Algorithms 9.5 and 9.6 is a function
of the nite time balancing parameter. This implies that with a nite amount of data,
Algorithms 9.5 and 9.6 are limited in the ability to identify a balanced representation.
In fact,
i
_
T + 1
2
_
max(m, p) + 1
_
_
,
where a denotes the highest integer smaller than a. In contrast, the persistency of exci-
tation required by Algorithms 9.2, 9.3, and 9.4 depends only on the upper bounds on the
system order and the lag and thus these algorithms can compute an innite time balanced
representation if the identiability condition holds.
9.5 Simulation Examples
In this section, we show examples that illustrate some of the advantages of the proposed
Algorithm 9.4. In all experiments the system B is given by a minimal input/state/output
representation with transfer function
C(Iz A)
1
B +D =
0.89172(z 0.5193)(z + 0.5595)
(z 0.4314)(z + 0.4987)(z + 0.6154)
.
The input is a unit variance white noise and the data available for identication is the
corresponding trajectory w
d
of B, corrupted by white noise with standard deviation . i
i
i
i
148 Chapter 9. Balanced Model Identication
0 2 4 6 8 10
1
0.5
0
0.5
1
t
H
(
t
)
,

H
(
t
)
= 0.0
||H

H|| = 10
15
0 2 4 6 8 10
1
0.5
0
0.5
1
t
H
(
t
)
,

H
(
t
)
= 0.1
||H

H|| = 0.02
0 2 4 6 8 10
1
0.5
0
0.5
1
t
H
(
t
)
,

H
(
t
)
= 0.2
||H

H|| = 0.05
0 2 4 6 8 10
1
0.5
0
0.5
1
t
H
(
t
)
,

H
(
t
)
= 0.4
||H

H|| = 0.21
Figure 9.1. Impulse response estimation. Solid red lineexact impulse response H, dashed
blue lineimpulse response

H computed from data via Algorithm 8.7.
Although our main concern is the correct work of the algorithms for exact data, i.e., with
= 0, by varying the noise variance
2
, we can investigate empirically the performance
under noise. The simulation time is T = 100. In all experiments the upper bounds n
max
and l
max
are taken equal to the system order n = 3 and the parameter L is taken equal to 3.
Consider rst the estimation of the impulse response. Figure 9.1 shows the exact
impulse response H of Band the estimate

H computed by Algorithm 8.7. With exact data,
[[H

H[[
F
= 10
15
, so that up to the numerical precision the match is exact. The plots in
Figure 9.1 show the deterioration of the estimates when the data is corrupted by noise.
Consider next the computation of the zero input response. Table 9.1 shows the error
of estimation e := [[Y
0


Y
0
[[
F
and the corresponding number of operations, where Y
0
is a
matrix of exact sequential zero input responses with length = 10 and

Y
0
is its estimate
computed from data. The estimate is computed in three ways: by Algorithm 8.9, imple-
mented with the QRfactorization; by the oblique projection, computed directly from(8.13);
and by the oblique projection, computed via the QR factorization; see Section 8.8.
Algorithm 8.9 needs fewer computations and gives more accurate results than the
alternatives. As already emphasized, the reason for this is that selecting the parameter
L = n
max
= 3 instead of L = = 10, as in a block computation, results in a more
overdetermined system of equations in step 4 of Algorithm 8.9 compared with system (8.9)
used in the block algorithm. (For the example, the difference is 95 vs. 88 columns.) As
a result, the noise is averaged over more samples, which leads to a better estimate in a i
i
i
i
9.6. Conclusions 149
Table 9.1. Error of estimation e = [[Y
0


Y
0
[[
F
and the corresponding number of opera-
tions f in megaops, where Y
0
is an exact sequence of zero input responses and

Y
0
is the
estimate computed from data.
Method = 0.0 = 0.1 = 0.2 = 0.4
e f e f e f e f
Alg. 8.9 with QR 10
14
130 1.2990 131 2.5257 132 4.7498 132
formula (8.13) 10
10
182 1.6497 186 3.2063 187 6.0915 189
(8.13) with QR 10
14
251 1.6497 251 3.2063 251 6.0915 252
statistical sense. Solving several more overdetermined systems of equations instead of one
more rectangular system can be more efcient, as it is in the example.
All algorithms return a nite time balanced model. The next experiment illustrates the
effect of the parameter on the balancing. Let W
c
/W
o
be the controllability/observability
gramians of an innite time balanced model and

W
c
/

W
o
be the controllability/observability
gramians of an identied model. Dene closeness to balancing by
e
2
bal
:=
[[W
c


W
c
[[
2
F
+[[W
o


W
o
[[
2
F
[[W
c
[[
2
F
+[[W
o
[[
2
F
.
Figure 9.2 shows e
bal
as a function of for the three algorithms presented in the chapter. The
estimates obtained by Algorithm 9.4 and the algorithm of Moonen and Ramos are identical.
The estimate obtained by the algorithm of Van Overschee and De Moor is asymptotically
equivalent, but for small , it is worse. This is a consequence of the fact that this algorithm
uses an approximation of the Hankel matrix of Markov parameters. Figure 9.2 also shows
e
bal
as a function of for noisy data with = 0.001 and the total number of ops required
by the three algorithms.
9.6 Conclusions
The impulse response and sequential free responses are the main tools for balanced model
identication. First, a (nonbalanced) state sequence is obtained from the sequential free
responses. Then a Hankel matrix is constructed from the impulse response, and from its
SVD a balancing transformation is obtained. A balanced state sequence is computed via a
change of basis and the corresponding balanced state representation is obtained by solving
a system of equations. We called this procedure the basic algorithm and showed that the
algorithms of Moonen and Ramos and Van Overschee and De Moor t into it. Based on the
algorithms for computation of the impulse response and sequential free responses directly
from data, we proposed alternative algorithms for balanced model identication.
There are a number of advantages of the proposed algorithms over the existing ones.
The algorithms of Moonen and Ramos and Van Overschee and De Moor compute the whole
Hankel matrix of Markov parameters H, while the proposed algorithms compute only the
elements that uniquely specify H and then construct H from them. Because of the Hankel
structure, the algorithms of Moonen and Ramos and Van Overschee and De Moor recompute
most elements of Hmany times. This is an inefcient step in these algorithms that we avoid. i
i
i
i
150 Chapter 9. Balanced Model Identication
5 10 15
0
0.01

e
b
a
l
(

)
= 0
uy2ssmr
uy2ssvd
uy2ssbal
5 10 15
0
0.01

e
b
a
l
(

)
uy2ssmr
uy2ssvd
uy2ssbal
= 0.001
5 10 15
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10
6

#
o
f

o
p
s
= 0
Figure 9.2. Closeness to balancing e
bal
and computational cost as functions of the -
nite time balancing parameter (uy2ssmrAlgorithm 9.5, uy2ssvdAlgorithm 9.6,
uy2ssbalAlgorithm 9.4).
In the algorithms of Moonen and Ramos and Van Overschee and De Moor, the nite
time balancing parameter is supplied by the user. In the proposed algorithms, it can be
determined automatically on the basis of a desired convergence tolerance of the impulse
response, which is directly related to the closeness of the obtained representation to a
balanced one.
The algorithms of Moonen and Ramos and Van Overschee and De Moor compute
nite time- balanced representation with
1
2
(T +1)/
_
max(m, p) +1
_
, where T is
the length of the given time series w
d
. The proposed algorithms have no such limitation and
can thus compute a representation that is arbitrary close to an innite time balanced one.
The proposed algorithms have weaker persistency of excitation condition than the
one needed for the algorithms of Moonen and Ramos and Van Overschee and De Moor. As
a result, in certain cases, the proposed algorithms are applicable, while the algorithms of
Moonen and Ramos and Van Overschee and De Moor are not. i
i
i
i
(A, B, C, D) u
u
y
y
u
d
y
d
x
ini
Figure 10.1. Block scheme of the dynamic EIV model.
Chapter 10
Errors-in-Variables
Smoothing and Filtering
State estimation problems for LTI systems with noisy inputs and outputs (EIV model, see
Figure 10.1) are considered.
An efcient recursive algorithm for the smoothing problem is derived. The equiva-
lence between the optimal lter and an appropriately modied Kalman lter is established.
The optimal estimate of the input signal is derived from the optimal state estimate. The
result shows that the EIV ltering problem is not fundamentally different from the classical
Kalman ltering problem.
10.1 Introduction
The EIV smoothing and ltering problems were rst put forward by Guidorzi, Diversi, and
Soverini in [GDS03], where a transfer function approach is used and recursive algorithms
that solve the ltering problem are derived. The treatment, however, is limited to the SISO
case and the solution obtained is not linked to the classical Kalman lter.
The MIMO case is addressed in [MD03], where the equivalence with a modied
Kalman lter is established. Closely related to the approach of [MD03] is the one used in
[DGS03]. The continuous-time version of the EIV state estimation problem is explicitly
solved in [MWD02] by a completion of squares approach.
In this chapter, we consider the EIV model
w
d
= w + w, where w B L
w,n
m
(EIV)
151 i
i
i
i
152 Chapter 10. Errors-in-Variables Smoothing and Filtering
and w is a white, stationary, zero mean, stochastic process with positive denite covariance
matrix V
w
:= cov
_
w(t)
_
for all t; i.e., we assume that the observed time series w
d
is a
noise corrupted version of a true time series w that is a trajectory of an LTI system B with
input cardinality m and state dimension n. The system B and the noise covariance V
w
are
assumed known.
We use the input/state/output representation of B = B
i/s/o
(A, B, C, D), i.e.,
w = col( u, y), where x = A x +B u, y = C x +D u, x(1) = x
ini
. (10.1)
Correspondingly, the observed time series w
d
and the measurement errors w have the in-
put/output partitionings w
d
= col(u
d
, y
d
) and w = col( u, y). Furthermore, the covariance
matrix V
w
is assumed to be block-diagonal, V
w
= diag(V
u
, V
y
), where V
u
, V
y
> 0.
The problem considered is to nd the LS estimate of the state x from the observed
data w
d
. We prove that the optimal lter is the Kalman lter for the system
x = A x +Bu
d
+v
1
,
y
d
= C x +Du
d
+v
2
,
(10.2)
where the process noise v
1
and the measurement noise v
2
are jointly white
cov
__
v
1
(t
1
)
v
2
(t
1
)
_
,
_
v
1
(t
2
)
v
2
(t
2
)
__
=
_
B 0
D I
_ _
V
u
V
y
_ _
B 0
D I
_

(t
1
t
2
)
=:
_
Q S
S

R
_
(t
1
t
2
).
(10.3)
The EIV state estimation problem is treated in [RH95] and [FW97]. The global
total least squares problem of [RH95] has as a subproblem the computation of the closest
trajectory in the behavior of a given system to a given time series. This is a deterministic
approximation problem corresponding to the EIV smoothing problem considered in this
chapter.
10.2 Problem Formulation
Consider the time horizon [1, T] and dene the covariance matrices of u, y, and w:
V
u
:= cov( u), V
y
:= cov( y), and V
w
:= cov( w).
Problem 10.1 (EIV smoothing problem). The EIV smoothing problem is dened by
min
x, w=col( u, y)
(w
d
w)

V
1
w
(w
d
w) subject to
x(t + 1) = A x(t) +B u(t), y(t) = C x(t) +D u(t), for t = 1, . . . , T,
(10.4)
and the EIV smoothed state estimate x(, T + 1) is the optimal solution of (10.4).
Under the normality assumption for w, x(, T +1) is the maximumlikelihood estimate
of x [GDS03]. i
i
i
i
10.3. Solution of the smoothing problem 153
Problem 10.2 (EIV ltering problem). The EIV ltering problem is to nd a dynamical
system
z = A
f
z +B
f
w
d
, x = C
f
z +D
f
w
d
(10.5)
such that x(t) = x(t, t + 1), where x() is the solution of (10.5); i.e., the EIV ltered state
estimate, and x(, t + 1) is the EIV smoothed state estimate with a time horizon t + 1.
The EIV ltering problem is dened as a state estimation problem. When the input is
measured with additive noise, an extra step is needed to nd the ltered input/output signals
from the state estimate. This is explained in Note 10.8.
Note 10.3 (Initial conditions) Problem10.1 implicitly assumes no prior information about
the initial condition x(1). Another possibility is that x(1) is exactly known. The standard
assumption is x(1) N(x
ini
, P
ini
), i.e., x(1) is normally distributed with mean x
ini
and
covariance P
ini
. An exactly known initial condition corresponds to P
ini
= 0 and an unknown
initial condition corresponds to information matrix P
1
ini
= 0.
We have chosen the initial condition assumption that results in the simplest derivation.
With the classical stochastic assumption x
ini
N(x
ini
, P
ini
), (10.4) becomes
min
u, y, x
_
_
_
_
_
_
_
_

_
_
_
P
ini
V
u
V
y
_
_
1
_
_
x
ini
x
ini
u
d
u
y
d
y
_
_
_
_
_
_
_
_
_
_
2
subject to
x(t + 1) = A x(t) +B u(t)
y(t) = C x(t) +D u(t)
for t = 1, . . . , T.
10.3 Solution of the Smoothing Problem
Block Algorithms
We represent the input/output dynamics of the system (10.1), over the time horizon [1, T],
explicitly as (see (VC))
y = O
T
(A, C) x
ini
+T
T
(H) u,
where H is the impulse response of B. Using this representation, the EIV smoothing
problem (10.4) becomes a classical weighted least squares problem
min
x
ini
, u
_
_
_
_

_
V
u
V
y
_
1
__
u
d
y
d
_

_
0 I
O
T
(A, C) T
T
(H)
_ _
x
ini
u
___
_
_
_
2
. (10.6)
Alternatively, we represent the input/state/output dynamics of the system, over the
time horizon [1, T], as
y = A x +B u, i
i
i
i
154 Chapter 10. Errors-in-Variables Smoothing and Filtering
where
y :=
_

_
y(1)
0
y(2)
0
.
.
.
y(T)
0
_

_
, A :=
_

_
C 0
A I
C 0
A I
.
.
.
.
.
.
C 0
A I
_

_
, B :=
_

_
D
B
D
B
.
.
.
D
B
_

_
.
Substituting y
d
y for y and u
d
u for u (see (EIV)), we have
y
d
+Bu
d
= A x +B u +C y, (10.7)
where y
d
is dened analogously to y and
C := diag
__
I
0
_
, . . . ,
_
I
0
__
.
Using (10.7) and dening w := col(u, y), (10.4) is equivalent to the problem
min
x,w
w

V
1
w
w subject to y
d
+Bu
d
= A x +
_
B C

w, (10.8)
which is a minimum norm-type problem, so that its solution also has closed form.
Recursive Algorithms
Next, we show a recursive solution for the case when x(1) = x
ini
is given and D = 0. The
more general case, x(1) N(x
ini
, P
ini
) and D ,= 0, leads to a similar but heavier result.
Dene the value function V

: R
n
R, for = 1, . . . , T as follows: V

(z) is
the minimum value of (10.4) over t = , . . . , T 1 with x() = z. Then V
1
(x
ini
) is the
optimum value of the EIV smoothing problem. By the dynamic programming principle, we
have
V

(z) = min
u()
__
_
_
_
V
1
u
_
u() u
d
()
_
_
_
_
2
+
_
_
_
_
V
1
y
_
Cz y
d
()
_
_
_
_
2
+V
+1
_
Az +B u()
_
_
. (10.9)
The value function V

is quadratic for all ; i.e., there are P() R


nn
, s() R
n
, and
v() R, such that
V

(z) = z

P()z + 2s

()z +v().
This allows us to solve (10.9).
Theorem 10.4 (Recursive smoothing). The solution of the EIV smoothing problem with
given x(1) = x
ini
and D = 0 is
u(t) =
_
B

P(t +1)B+V
1
u
_
1
_
B

P(t +1)A x(t) +B

s(t +1) V
1
u
u
d
(t)
_
,
(10.10) i
i
i
i
10.4. Solution of the ltering problem 155
x(t + 1) = A x(t) + B u(t), with x(1) = x
ini
, and y(t) = C x(t) for t = 0, . . . , T 1,
where
P(t) = A

P(t + 1)B
_
B

P(t + 1)B +V
1
u
_
1
B

P(t + 1)A
+A

P(t + 1)A+C

V
1
y
C, (10.11)
for t = T 1, . . . , 0, with P(T) = 0, and
s(t) = A

P(t + 1)B
_
B

P(t + 1)B +V
1
u
_
1
_
B

s(t + 1) V
1
u
u
d
(t)
_
+A

s(t + 1) C

V
1
y
y
d
(t), (10.12)
for t = T 1, . . . , 0, with s(T) = 0.
Proof. See Appendix A.4.
P and s are obtained from the backward-in-time recursions (10.11) and (10.12), and
the estimates u, x, and y are obtained from the forward-in-time recursion (10.10).
Note 10.5 (Suboptimal smoothing) With (A, C) observable, (10.11) has a steady state
solution

P that satises the algebraic Riccati equation

P = A


PB(B


PB +V
1
u
)
1
B


PA+A


PA+C

V
1
y
C. (10.13)
In a suboptimal solution, the unique positive denite solution

P
+
of (10.13) can be substi-
tuted for P(t) in (10.12) and (10.10). This is motivated by the typically fast convergence
of P(t) to

P
+
. Then the smoothing procedure becomes
1. nd the positive denite solution

P
+
of the algebraic Riccati equation (10.13),
2. simulate the LTI system (10.12) with P(t) =

P
+
, for all t,
3. simulate the LTI system (10.10) with P(t) =

P
+
for all t.
10.4 Solution of the Filtering Problem
Analogously to the derivation of (10.7) in Section 10.3, now we derive an equivalent model
to the EIV model representation in the form (10.2). Substitute u u for u and y y for y
(see (EIV)) in (10.1):
x = A x +Bu
d
B u, y
d
= C x +Du
d
D u + y.
Then dene a fake process noise v
1
and measurement noise v
2
by
v
1
:= B u and v
2
:= D u + y.
The resulting system
x = A x +Bu
d
+v
1
, y
d
= C x(t) +Du
d
+v
2
(10.14) i
i
i
i
156 Chapter 10. Errors-in-Variables Smoothing and Filtering
is in the form (10.2), where Q, S, and R are given in (10.3).
The Kalman lter corresponding to the modied system (10.14) with the covariance
(10.3) is
z = A
kf
z +B
kf
w
d
, x = C
kf
z +D
kf
w
d
, (10.15)
where
A
kf
(t) =
_
AK(t)C
_
, B
kf
(t) =
_
B K(t)D K(t)

,
C
kf
(t) = I P(t)C

_
CP(t)C

+R
_
1
C, (10.16)
D
kf
(t) = P(t)C

_
CP(t)C

+R
_
1
_
D I

,
K(t) =
_
AP(t)C

+S
__
CP(t)C

+R
_
1
,
and
P(t + 1) = AP(t)A

_
AP(t)C

+S
__
CP(t)C

+R
_
1
_
AP(t)C

+S
_

+Q.
We call (10.15) the modied Kalman lter. It recursively solves (10.7) (which is equivalent
to (10.14)) for the last block entry of the unknown x. The solution is in the sense of the
WLS problem
min
x, e
e

_
_
B C

V
w
_
B C

_
1
e subject to y
d
+Bu
d
= A x + e,
which is an equivalent optimization problem to the EIV smoothing problem (10.8). There-
fore, the EIV ltering problem is solved by the modied Kalman lter.
Theorem 10.6. The solution of the EIV ltering problem is A
f
= A
kf
, B
f
= B
kf
, C
f
= C
kf
,
and D
f
= D
kf
, dened in (10.16).
Note 10.7 (Suboptimal ltering) One can replace the time-varying Kalman lter with the
(suboptimal) time-invariant lter, obtained by replacing P(t) in (10.15) with the positive
denite solution

P
+
of the algebraic Riccati equation

P = A

PA

_
A

PC

+S
__
C

PC

+R
_
1
_
A

PC

+S
_

+Q.
Equivalently, one can argue that the time-invariant lter is optimal when T goes to innity.
Note 10.8 (Optimal estimation of the input/output signals) Up to now we were inter-
ested in optimal ltering in the sense of state estimation. The optimal estimates of the
input and the output, however, can be derived from the modied Kalman lter. The state
estimate x, the one-step-ahead prediction z(t +1), and the optimal input estimate u satisfy
the equation
z(t + 1) = A x(t) +B u(t). (10.17)
Then we can nd u exactly from x and z(t + 1), obtained from the modied Kalman
lter (10.15). In fact, (10.17) and the Kalman lter equations imply that
u(t) = E(t)z(t) +F(t)w
d
(t), (10.18) i
i
i
i
10.5. Simulation examples 157
where E(t) := V
u
D

_
CP(t)C

+R(t)
_
1
C and
F(t) :=
_
I V
u
D

_
CP(t)C

+R
_
1
D , V
u
D

_
CP(t)C

+R
_
1
_
.
The optimal output estimate is
y(t) =
_
CC
kf
(t) +DE(t)
_
z(t) +
_
CD
kf
(t) +DF(t)
_
w
d
(t). (10.19)
Appending the output equation of the EIV lter (10.5) with (10.18) and (10.19), we have
an explicit solution of the EIV ltering problem of [GDS03] as a (modied) Kalman lter.
Note 10.9 (Mist/latency) More general estimation problems occur when w is generated
by the stochastic model (10.2) with a noise covariance matrix
V
v
:= cov
__
v
1
(t)
v
2
(t)
__
,
and the data w
d
, available for estimation, is generated by the EIV model (EIV). The EIV
smoothing and ltering problems can be dened in this case analogously to Problems 10.1
and 10.2, and the results of the chapter can be repeated mutatis mutandis for the new
problems. The nal result is the equivalence of the EIV lter to the modied Kalman
lter (10.15)(10.16), with the only difference that now
_
Q S
S

R
_
= V
v
+
_
B 0
D I
_ _
V
u
V
y
_ _
B 0
D I
_

.
The more general setup is attractive because the noises v
1
, v
2
have different interpretation
from that of w. The former models the latency contribution and the latter models the mist
contribution; see [LD01, MWD02].
10.5 Simulation Examples
We illustrate numerically the results of the chapter. The parameters of the input/state/output
representation of B are
A =
_
0.6 0.45
1 0
_
, B =
_
1
0
_
, C =
_
0.48429 0.45739

, and D = 0.5381.
The time horizon is T = 100, the initial state is x
ini
= 0, the input signal is a normal white
noise sequence with unit variance, and V
u
= V
y
= 0.4.
The estimate of the EIV lter is computed directly from the denition; i.e., we solve a
sequence of smoothing problems with increasing time horizon. Every smoothing problem
is as a WLS problem (10.6). The last block entries of the obtained sequence of solutions
form the EIV lter state estimate.
We compare the EIV lter estimate with the estimate of the modied Kalman l-
ter (10.15). The state estimate x
kf
obtained by the modied Kalman lter is up to numerical
errors equal to the state estimate x
f
obtained by the EIV lter, | x
kf
x
f
| < 10
14
. This is
the desired numerical verication of the theoretical result. The absolute errors of estimation
| x x|
2
, | u u|
2
, | y y|
2
for all estimation methods presented in the chapter are given
in Table 10.1. i
i
i
i
158 Chapter 10. Errors-in-Variables Smoothing and Filtering
Table 10.1. Comparison of the absolute errors of the state, input, and output estimates for
all methods (MKFmodied Kalman lter).
Method | x x|
2
| u u|
2
| y y|
2
optimal smoothing 75.3981 29.2195 15.5409
optimal ltering 75.7711 35.5604 16.4571
time-varying MKF 75.7711 35.5604 16.4571
time-invariant MKF 76.1835 35.7687 16.5675
noisy data 116.3374 42.4711 41.2419
10.6 Conclusions
We considered optimal EIVestimation problems for discrete-time LTI systems. Arecursive
solution for the smoothing problemis derived. The ltering problemis solved via a modied
Kalman lter. The equivalence between the EIV lter and the modied Kalman lter is
established algebraically using explicit state space representation of the system. The optimal
estimate of the input is a linear function of the optimal state estimate, so that it is obtained
by an extra output equation of the modied Kalman lter. The results are extended to the
case when the system is driven by a measured and an unobserved input. i
i
i
i
Chapter 11
Approximate System
Identication
The following identication problem is considered:
Minimize the 2-norm of the difference between a given time series and an approx-
imating one under the constraint that the approximating time series is a trajectory
of an LTI system of a xed complexity.
The complexity is measured by the input cardinality and the lag. The question leads to the
global total least squares problem (TLS
R(z)
) and alternatively can be viewed as maximum
likelihood identication in the EIV setting. Multiple time series and latent variables can
be considered in the same setting. Special cases of the problem are autonomous system
identication, approximate realization, and nite time optimal
2
model reduction.
The identication problem is related to the structured total least squares problem
(STLS
X
), so that it can be solved in practice by the methods developed in Chapter 4 and
the software tool presented in Appendix B.2. The proposed system identication method
and software implementation are tested on data sets from the database for the identication
of systems (DAISY).
11.1 Approximate Modeling Problems
Figure 11.1 shows three approximate modeling problems. On top is the model reduc-
tion problem: given an LTI system

B, nd an LTI approximation

B of a desired lower
complexity. A tractable solution that gives very good results in practice is balanced trun-
cation [Moo81]. We consider nite time
2
optimal model reduction: the sequence of the
rst T Markov parameters of

B is approximated by the sequence of the corresponding
Markov parameters of

B in the 2-norm sense.
The identication problem is similar to the model reduction one but starts instead
from an observed response w
d
. Various data collection models (the down arrows from

Bto
w
d
and H in Figure 11.1) are possible. For example, the EIV model is w
d
= w + w, where
w is a trajectory generated by

B and w is measurement noise.
159 i
i
i
i
160 Chapter 11. Approximate System Identication

B
model reduction

d
a
t
a
c
o
l
l
e
c
t
i
o
n

B
w
d
STLS

identi
cation
g
g
g
g
g
g
g
g
g
g
g
g

g
g
g
g
g
g
g
g
g
g
g
g
w

H
STLS

a
p
p
r
o
x
im
a
te
r
e
a
liz
a
tio
n
p
p
p
p
p
p
p
p
p
p
p

p
p
p
p
p
p
p
p
p
p
p

n
e
w
r
e
s
u
l
t

H
r
e
a
l
i
z
a
t
i
o
n

Figure 11.1. Different problems aiming at a (low complexity) model



B that approximates
a given (high complexity) model

B. The time series w
d
is an observed response and H is
an observed impulse response.
Of independent interest are the identication problems from a measurement of the
impulse response H =

H +

H, which is an approximate realization problem, and the
autonomous system identication problem, where w and w are free responses. A classical
solution to these problems is Kungs algorithm [Kun78].
The key observation that motivates the application of STLS for system identication
and model reduction is that their kernel subproblem is to nd a block-Hankel rank-decient
matrix H ( w) approximating a given full-rank matrix H (w
d
) with the same structure.
Noniterative methods such as balanced model reduction, subspace identication,
and Kungs algorithm solve the kernel approximation problem via the SVD.
For nite matrices, however, the SVD approximation of H (w
d
) is unstructured. For this
reason the algorithms based on the SVD are suboptimal with respect to an induced norm of
the approximation error w := w
d
w. The STLS method, on the other hand, preserves
the structure and is optimal according to this criterion.
Our purpose is to show how system theoretic problems with mist optimality
criterion are solved as equivalent STLS problems

X = arg min
X
_
min
w
|w
d
w| subject to S( w)
_
X
I
_
= 0
_
(STLS
X
)
and subsequently make use of the efcient numerical methods developed for the
STLS problem.
The constraint of (STLS
X
) enforces the structured matrix S( w) to be rank decient, with
rank at most rowdim(X) and the cost function measures the distance fromthe given data w
d
to its approximation w. The STLS problem aims at optimal structured low-rank approxi-
mation of S(w
d
) by S( w); cf. Chapter 4.
The STLS method originates from the signal processing and numerical linear algebra
communities and is not widely known in the area of systems and control. The classical
TLS method is known in the early system identication literature as the KoopmansLevin i
i
i
i
11.1. Approximate modeling problems 161
method [Lev64]. In this chapter, we show the applicability of the STLS method for system
identication. We extend previous results [DR94, LD01] of the application of STLS for
SISO system identication to the MIMO case and present numerical results on data sets
from DAISY [DM05].
The Global Total Least Squares Problem
Let M be a user-specied model class and let w
d
be an observed time series of length T N.
The model class restricts the maximal allowed model complexity. Within M, we aim to
nd the model

B that best ts the data according to the mist criterion

B := arg min
BM
M(w
d
, B), with M(w
d
, B) := min
wB
|w
d
w|.
The resulting optimization problem is known as the global total least squares (GlTLS)
problem [RH95].
The approach of Roorda and Heij [RH95] and Roorda [Roo95] is based on solving
the inner minimization problem, the mist computation, by isometric state representation of
the system and subsequently alternating least squares or GaussNewton-type algorithm for
the outer minimization problem. They use a state space representation with driving input.
Our approach of solving the GlTLS problem is different. We use a kernel representation of
the system and relate the identication problem to the STLS problem (STLS
X
).
Link with the Most Powerful Unfalsied Model
In Chapter 8, we introduced the concept of the most powerful unfalsied model (MPUM).
A model B is unfalsied by the observation w
d
if w
d
B. A model B
1
is more powerful
than B
2
if B
1
B
2
. Thus the concept of the MPUM is to nd the most powerful model
consistent with the observationsa most reasonable and intuitive identication principle.
In practice, however, the MPUM can be unacceptably complex. For example, in the
EIV setting the observation w
d
:=
_
w
d
(1), . . . , w
d
(T)
_
, w
d
(t) R
w
, is perturbed by noise,
so that with probability one the MPUM is B
mpum
= (R
w
)
T
; see Note 8.10. Such a model
is useless because it imposes no laws.
The GlTLS problem addresses this issue by restricting the model complexity by the
constraint

B M, where M is an a priori specied model class. Whenever the MPUM
does not belong to M, an approximation is needed. The idea is to
correct the given time series as little as possible, so that the MPUMof the corrected
time series belongs to M.
This is a most reasonable adaptation of the MPUM to approximate identication. The
measure of closeness is chosen as the 2-norm, which weights equally all variables over all
time instants. In a stochastic setting, weighted norms can be used in order to take into
account prior knowledge about nonuniform variance among the variables and/or in time. i
i
i
i
162 Chapter 11. Approximate System Identication
11.2 Approximate Identication by Structured Total
Least Squares
The considered approximate identication problem is dened as follows.
Problem 11.1 (GlTLS). For given time series w
d
(R
w
)
T
and a complexity specication
(m, l), where m is the maximum number of inputs and l is the maximum lag of the identied
system, solve the optimization problem

B := arg min
BL
w
m,l
_
min
wB
|w
d
w|
_
. (GlTLS)
The optimal approximating time series is w

, corresponding to a global minimum point


of (GlTLS), and the optimal approximating system is

B.
The inner minimization problem of (GlTLS), i.e., the mist M(w
d
, B) computation,
has the system theoretic meaning of nding the best approximation w

of the given time


series w
d
that is a trajectory of the (xed from the outer minimization problem) system B.
This is a smoothing problem.
Our goal is to express (GlTLS) as an STLS problem (STLS
X
). Therefore, we need to
ensure that the constraint S( w)
_
X
I

= 0 is equivalent to w B L
w
m,l
. As a byproduct
of doing this, we relate the parameter X in the STLS problem formulation to the systemB.
The equivalence is proven under an assumption that is conjectured to hold generically in
the data space (R
w
)
T
.
Lemma 11.2. Consider a time series w (R
w
)
T
and assume (without loss of gener-
ality) that there are natural numbers l and p, p 1, and a matrix R R
p(l+1)w
,
such that RH
l+1
(w) = 0. Dene R =:
_
R
0
R
1
R
l

, where R
i
R
pw
,
R(z) :=

l
i=0
R
i
z
i
, and B := ker
_
R()
_
. Then w B[
[1,T]
and if R
l
is full rank,
B L
w,pl
m,l
, where m := w p.
Proof. From the identity
RH
l+1
(w) = 0

l
=0
R

w(t +) = 0, for t = 1, . . . , T l,
it follows that w B[
[1,T]
.
By denition B, is a linear system with lag l(B) l. The assumption that R
l
is full
row rank implies that R(z) is row proper. Then the number of outputs of B is p(B) = p
and therefore the number of inputs is m(B) = w p(B) = m. Let l
i
be the degree of the
ith equation in R()w = 0. The assumption that R
l
is full row rank implies that l
i
= l
for all i. Therefore, n(B) =

p
i=1
l
i
= pl.
The next lemma states the reverse implication.
Lemma 11.3. Consider a time series w (R
w
)
T
and assume (without loss of generality)
that there are natural numbers l and m < w and a systemB L
w
m,l
, such that w B[
[1,T]
.
Let R()w = 0, where R(z) =

l
i=0
R
i
z
i
, be a shortest lag kernel representation of B. i
i
i
i
11.2. Approximate identication by structured total least squares 163
Then the matrix R :=
_
R
0
R
1
R
l

annihilates the Hankel matrix H


l+1
(w), i.e.,
RH
l+1
(w) = 0. If, in addition, n(B) = pl, then R
l
is full row rank.
Proof. From w B[
[1,T]
, it follows that RH
l+1
(w) = 0.
Let l
i
be the degree of the ith equation in R()w = 0. We have l
i
l and
n(B) =

p
i=1
l
i
. The assumption n(B) = pl is possible only if l
i
= l for all i. Because
R(z) is row proper (by the shortest lag assumption of the kernel representation), the leading
row coefcient matrix L has full row rank. But since l
i
= l, for all i, L = R
l
.
We have the following main result.
Theorem 11.4. Let B := ker
_
R()
_
L
w
m,l
, where R(z) =

l
i=0
R
i
z
i
is row proper,
and dene the partitioning
R
l
=:
m w m
_
Q
l
P
l

.
If P
l
is nonsingular, then for any w (R
w
)
T
,
w B[
[1,T]
H

l+1
(w)
_
X
I
_
= 0, where X

= P
1
l
_
R
0
R
l1
Q
l

.
Proof. The assumption of the theorem is stronger than the assumptions of Lemmas 11.2
and 11.3 because not only is R
l
required to be of full row rank but its submatrix P
l
is
required to have this property. In the direction of assuming w B[
[1,T]
, by Lemma 11.3,
it follows that RH
l+1
(w) = 0. Since P
l
is nonsingular, RH
l+1
(w) = 0 is equivalent
to H

l+1
(w)
_
X
I

= 0, with X

:= P
1
l
_
R
0
R
l1
Q
l

. In the opposite direc-


tion, by Lemma 11.2, B = ker
_
l
i=0
R
i

i
_
with
_
R
0
R
1
R
l

:=
_
X

.
Therefore, P
l
= I is nonsingular.
Theorem 11.4 states the desired equivalence of the identication problem and the
STLS problem under the assumption that the optimal approximating system

B admits a
kernel representation

B = ker
_
l

i=0

R
i

i
_
,

R
l
:=
_

Q
l

P
l

with

P
l
R
pp
nonsingular. ()
We conjecture that condition () holds true for almost all w (R
w
)
T
. Dene the subset
of (R
w
)
T
consisting of all time series w (R
w
)
T
for which the identication problem is
equivalent to the STLS problem, i.e.,
:=
_
w
d
(R
w
)
T

problem (GlTLS) has a unique global


minimizer

B that satises ()
_
.
Conjecture 11.5. The set is generic in (R
w
)
T
; i.e., it contains an open subset whose
complement has measure zero.
The existence and uniqueness part of the conjecture (see the denition of ) is moti-
vated in [HS99, Section 5.1]. The motivation for () being generic is the following one. i
i
i
i
164 Chapter 11. Approximate System Identication
The highest possible order of a systemin the model class L
w
m,l
is pl. Then generically
in the data space (R
w
)
T
, n(

B) = pl. By Lemma 11.3, n(

B) = pl implies that in a kernel
representation

B = ker
_
l
i=0

R
i

i
_
,

R
l
is of full row rank. But generically in R
pw
the
matrix

P
l
R
pp
, dened by

R
l
=:
_

Q
l

P
l

, is nonsingular. Although the motivation


for the conjecture is quite obvious, the proof seems to be rather involved.
Properties of the Solution
The following are properties of the smoothing problem:
1. w is orthogonal to the correction w := w
d
w and
2. w is generated by an LTI system B

L
w
p,l
.
Since the identication problem has as an inner minimization problem, the smoothing
problem, the same properties hold in particular for the optimal solution of (GlTLS). These
results are stated for the SISOcase in [DR94] and then proven for the MIMOcase in [RH95,
Section VI].
Statistical properties of the identication problem(GlTLS) are studied in the literature.
For the stochastic analysis, the EIV model is assumed and the basic results are consistency
and asymptotic normality. Consistency in the SISOcase is proven in [AY70b]. Consistency
in the MIMO case is proven in [HS99] in the framework of the GlTLS problem. Complete
statistical theory with practical condence bounds is presented in [PS01] in the setting of
the Markov estimator for semilinear models. Consistency of the STLS estimator for the
general structure specication described in Chapter 4 is proven in [KMV05].
Numerical Implementation
Arecursive solution of the smoothing problemM(w
d
, B) is obtained by dynamic program-
ming in Section 10.3 for the special case D = 0 and exactly known initial condition. An
alternative derivation (for general Dand unknown initial conditions) is derived by isometric
state representation in [RH95]. Both solutions are derived from a system theoretic point of
view. A related problem occurs in the STLS problem formulation. Because of the exible
structure specication, the inner minimization problem in the STLS formulation (STLS
X
)
is more general than the smoothing problem M(w
d
, B), where a block-Hankel structure is
xed. In Chapter 4, a closed form expression is derived for the latter problem and a special
structure of the involved matrices is recognized. The structure is then used on the level of
the computation by employing numerical algorithms for structured matrices. The resulting
computational complexity is linear in the length T of the given time series w
d
.
The outer minimization problemmin
BM
M(w
d
, B), however, is a difcult noncon-
vex optimization problem that requires iterative methods. Two methods are proposed in the
framework of the GlTLS problem. In [RH95] an alternating least squares method is used.
Its convergence is linear and can be very slowin certain cases. In [Roo95], a GaussNewton
algorithm is proposed. For the solution of the STLS problem, a LevenbergMarquardt al-
gorithm is used. The convergence of all these algorithms to the desired global minimum is
not guaranteed and depends on the provided initial approximation and the given data.
Software for solving the GlTLS problem is described in Appendix B.4. i
i
i
i
11.3. Modications of the basic problem 165
11.3 Modications of the Basic Problem
Input/Output Partitionings
A standard assumption in system identication is that an input/output partitioning of the
variables is a priori given. Consider a ww permutation matrix and redene was w. The
rst m variables of the redened time series are assumed inputs and the remaining p variables
outputs. With col(u, y) := w and
_

Q(z)

P(z)

:=

R(z), the kernel representation

R()w = 0 becomes a left matrix fraction representation



Q()u =

P()y. The transfer
function of

B for the xed by input/output partitioning is

G(z) :=

P
1
(z)

Q(z).
Note that under the assumption n(

B) = p(

B)l(

B), the state space representation

A =
_

_
0 0

P
0
I

P
1
.
.
.
.
.
.
I

P
l1
_

_
,

B =
_

Q
0


P
0

Q
l

Q
1


P
1

Q
l
.
.
.

Q
l1


P
l1

Q
l
_

_
,

C =
_
0 0 I

,

D =

Q
l
is minimal. Therefore, the transition from

P and

Q (which is the result obtained from the
optimization problem) to an input/state/output representation is trivial and requires extra
computations only for the formation of the

B matrix.
Conjecture 11.5 implies that generically the optimal approximation

B admits an
input/output partitioning col(u, y) := w, with = I. Moreover, we conjecture that
generically

B admits an arbitrary input/output partitioning (i.e., col(u, y) := w for any
permutation matrix ).
Exact Variables
Another standard assumption is that the inputs are exact (in the EIV setting noise-free). Let
u and y be the approximating input and output. The assumption that u
d
is exact imposes
the constraint u = u
d
.
More generally, if some variables of w
d
are exact, then the corresponding elements
in w are xed. In the STLS problem formulation (STLS
X
), the exact elements of w
d
can
be separated in a block of S(w
d
) by permuting the columns of H

l+1
(w
d
). The STLS
package described in Appendix B.2 allows specication of exact blocks in S(w
d
) that are
not modied in the solution S( w). After solving the modied problem, the solution

X of
the original problem, with exact variables, is obtained by applying the reverse permutation.
With a given input/output partition and exact inputs, the GlTLS problem becomes
the classical output error identication problem. Moreover, in the single output case the
GlTLS mist is equivalent to the cost function minimized by the prediction error methods.
The following simulation example shows that the GlTLS optimal model is equivalent to the
model computed by the pem function fromthe SystemIdentication Toolbox of MATLAB,
when the output error structure is specied.
Example 11.6 (Output error system identication) We use the data set Hair dryer from
[Lju99], available via DAISY [DM05], and search for an approximate system in the model
class L
2
1,5
. On the one hand, we use the GlTLS method, implemented by the function i
i
i
i
166 Chapter 11. Approximate System Identication
Table 11.1. Comparison of the GlTLS and prediction error methods on a SISO output error
identication problem.
Function Time, sec Simulation t GlTLS mist
pem 5.5 91.31059766532992 2.27902450157299
stlsident 2.9 91.31059766354954 2.27902450178058
stlsident (see Appendix B.4) with the specication that the input is exact. On the
other hand, we use the function pem evoked with the following calling sequence, which
corresponds to output error identication:
sys = pem( iddata(y,u),l, ...
nk, 0, ...
DisturbanceModel, None, ...
SSParameterization, Canonical, ...
InitialState, Estimate, ...
LimitError, 0, ...
Tolerance, 1e-5, ...
MaxIter, 100 );
The identied systems by stlsident and pem are compared in Table 11.1 in terms
of the simulation t (computed by the function compare from the System Identication
Toolbox), the GlTLS mist, and the computation time. Note that
compares simulation t = 100(1 M
oe
/|y|). (FIT)
Multiple Time Series
In certain cases, e.g., the approximate realization problem, multiple observed time series
w
d,1
, . . . , w
d,N
are given. Assume that all time series are of the same length and dene w
d
to
be the matrix valued time series w
d
=
_
w
d,1
w
d,N

, so that w
d
(t) R
wN
. The only
modication needed in the GlTLS solution for this case is to consider block-Hankel matrix
H
l+1
(w
d
) with size of blocks w N instead of w 1, as for the case of a single observed
time series. The software package described in Appendix B.2 treats such problems.
Known Initial Conditions
In the GlTLS problem, no prior knowledge about initial conditions is assumed. Thus the
best tting trajectory w is sought in the whole behavior

B. If the initial conditions are
a priori known, w should be searched only among the trajectories of

B generated by the
specied initial conditions. Typical examples of identication problems with known initial
conditions are approximate realization and identication from step response observations.
In both cases, the initial conditions are a priori known to be zero.
Zero initial conditions can be taken into account in the identication problem by
extending the given time series w
d
with l zero samples. Let w
ext
be the extended data
sequences obtainedinthis way. Inorder toensure that the approximation w
ext
is alsoobtained i
i
i
i
11.4. Special problems 167
under zero initial conditions, the rst l samples of w
ext
should be preserved unmodied
in w
ext
.
Note 11.7 (Known initial conditions) Inthe current software implementationof the GlTLS
method, the specication that the l leading data samples are exact is not possible. This
feature of the identication problem goes beyond the scope of the current STLS solution
method and software.
Latent Inputs
The classical system identication framework [Lju99] differs from the one in this chapter
in the choice of the optimization criterion and the model class. In [Lju99], an unobserved
input is assumed to act on the system that generates the observations and the optimization
criterion is dened as the prediction error.
An unobserved input e, of dimension e, called latent input, can be accommodated in
the setting of Section 11.2 by augmenting the model class M = L
w
m,l
with e extra inputs
and the cost function |w
d
w|
2
with the term|e|
2
. The resulting identication problem is
min
BL
w+e
m+e,l
_
min
e, w
|w
d
w|
2
. .
mist
+| e|
2
..
latency
subject to
_
e
w
_
B
_
. (M+L)
This problem unies the mist and latency descriptions of the uncertainty and is put for-
ward by Lemmerling and De Moor [LD01]. In [LD01], it is claimed that the pure latency
identication problem
min
BL
w+e
m+e,l
_
min
e
| e|
2
subject to
_
e
w
d
_
B
_
(L)
is equivalent to the prediction error approach.
The mistlatency identication problem (M+L) can easily be reformulated as an
equivalent pure mist identication problem (GlTLS). Let w
aug
:= col(e, w
d
), where
e := 0 is an e dimensional zero time series. Then the mist minimization problem for
the time series w
aug
and the model class L
w+e
m+e,l
is equivalent to (M+L). The pure latency
identication problem (L) can also be treated in our framework by considering w
d
exact
(see Exact-variables above) and modifying only e. Note that the latent input amounts
to increasing the complexity of the model class, so that a better t is achieved with a less
powerful model.
11.4 Special Problems
In this section we consider three special identication problems in an input/output setting.
In the rst one the data is an observed impulse response. In the second one the data is an
observed free response. In the third one the data is an exact impulse response of a high
order system, i.e., a system that is not in the specied model class. i
i
i
i
168 Chapter 11. Approximate System Identication
The Approximate Realization Problem
Identication from exact impulse response is the topic of (partial) realization theory; see
Section 8.7. When the given data (impulse response observations) is not exact, an approx-
imation is needed. Kungs algorithm is a well-known solution for this problem. However,
Kungs algorithm is suboptimal in terms of the mist criterion
M
imp
(H
d
, B) = |H
d


H|, where

H is an impulse response of B.
Note that in this special case the mist computation does not involve optimization because
the initial conditions and the input are xed. The GlTLS problem can be used to nd
an optimal approximate model in terms of the mist M
imp
(H
d
, ). Next, we consider the
following approximate realization problem [DM94]:
Given a matrix valued time series H
d
(R
pm
)
T+1
and a natural number l, nd a
system

B L
w
m,l
, where w := m + p, whose impulse response

H

minimizes the
approximation error |H
d


H| :=
_

T
t=0
|H
d
(t)

H(t)|
2
F
.
The approximate realization problem is a special GlTLS problem and can be treated
as such. Now, however, the given trajectory is an observed impulse response, so that the
input is a pulse and the initial conditions are zeros. For this reason the direct approach
is inefcient. Moreover, known zero initial conditions cannot be specied in the current
software implementation; see Note 11.7. In the rest of this section we describe an indirect
solution that exploits the special features of the data and avoids specication of zero initial
conditions.
The following statement is a corollary of Theorem 11.4.
Corollary 11.8. Let B := ker
_
R()
_
L
w
m,l
, where R(z) =

l
i=0
R
i
z
i
is row proper,
and dene the partitioning
R
l
=:
m p
_
Q
l
P
l

.
If P
l
is nonsingular, then for any H (R
pm
)
T+1
,
H is an impulse response of B H

l+1
(H)
_
X
I

= 0,
where X

= P
1
l
_
P
0
P
1
P
l

.
Therefore, under assumption (), the approximate realization problem can be solved
as an STLS problem with structured data matrix H

l+1
(H
d
). Next, we show how one can
obtain an input/state/output representation of the optimal approximating system

Bfrom

X
and the l approximated Markov parameters

H(1), . . . ,

H(l).
By Corollary 11.8, rank
_
H

l+1
(

H)
_
=: n = lp. Let
H
l+1
(

H) =
be a rank revealing factorization. Since

H is an impulse response of

B, and must be
of the form
= O
l+1
(

A,

C), = C
Tl
(

A,

B), i
i
i
i
11.4. Special problems 169
where

B = B
i/s/o
_

A,

B,

C,

D
_
. (The basis of the representation is xed by the rank
revealing factorization.) We have
H

l+1
(

H)
_

X
I
_
= 0 = ()

_

X
I
_
= 0
=
_

X

= 0,
so that col span() ker(
_

X

). But dim
_
col span()
_
= n. On the other hand,
dim
_
ker(
_

X

)
_
= (l + 1)p p = n,
so that col span() = ker(
_

X

). Therefore, a basis for the null space of


_

X

denes an observability matrix of



B, fromwhich

C and

Acan be obtained up to a similarity
transformation.

D = H
d
(0) and

B is the unique solution of the system
O
l
(

A,

C)

B = col
_

H(1), . . . ,

H(l)
_
.
Example 11.9 (Approximate realization) Consider a simulationexample inthe EIVsetup.
The data H
d
=

H +

H is as a noise corrupted impulse response

H of an LTI system

B.
The time horizon is T = 50 and the additive noise standard deviation is = 0.25. The
true system

B is random stable (obtained via the MATLAB function drss) with m = 2
inputs, p = 2 outputs, and lag l = 2. The approximate model

B is sought in the model
class L
m+p
m,l
.
We apply a direct identication from input/output data (the impulse response is ex-
tended with l zeros) and the indirect procedure described above. In the two cases, the
optimization algorithm converges in 1.13 sec. and 0.63 sec., respectively, which shows the
better efciency of the indirect algorithm. The relative estimation errors |

H

H|/|

H|
in the two cases are 0.2716 and 0.2608, respectively. (The difference is due to the wrong
treatment of the initial conditions in the direct method.) For comparison, the relative error
with respect to the data H
d
is 0.9219. Figure 11.2 shows the tting of the impulse response
of the true system

B by the impulse response of the approximating systems.
Identication of an Autonomous System
The autonomous system identication problem is dened as follows:
Given a time series y
d
(R
p
)
T
and a natural number l, nd a system

B L
p
0,l
and a vector x

ini
R
n(

B)
, such that the free response y

of

B obtained under
initial condition x

ini
minimizes the approximation error |y
d
y|.
This problem is a special case of the approximate realization problem. The shifted
impulse response H of the system B
i/s/o
(A, x
ini
, C, ) is equal to the free response of
the system B
i/s/o
(A, , C, ), obtained under initial condition x
ini
. Thus the identication
problem for an autonomous system can be solved as an approximate realization problem
with the obvious substitution. It is easy to generalize the autonomous system identication
problem for multiple time series y
d,1
, . . . , y
d,N
; see Note 8.27. i
i
i
i
170 Chapter 11. Approximate System Identication
0 5 10 15 20
2.5
2
1.5
1
0.5
0
0.5
1
true
data
appr. 1
appr. 2
t
H
1
1
0 5 10 15 20
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
true
data
appr. 1
appr. 2
t
H
1
2
0 5 10 15 20
1.5
1
0.5
0
0.5
1
1.5
2
2.5
true
data
appr. 1
appr. 2
t
H
2
1
0 5 10 15 20
1
0.5
0
0.5
1
1.5
true
data
appr. 1
appr. 2
t
H
2
2
Figure 11.2. Identication from impulse response. Solid lineexact impulse response

H,
dotted linedata H
d
, dashed lineapproximating impulse response

H from the direct ap-
proach, dashed-dotted lineapproximating impulse response

H fromthe indirect approach.
Example 11.10 (Identication of an autonomous system) Consider the same simulation
setup as in Example 11.9 with the only difference being that the true data y is a free
response of length T = 20, obtained under random initial condition. The relative error of
approximation | y y|/| y| is 0.4184 versus 0.7269 for the given data y
d
. Figure 11.3 shows
the tting of the free response of the true system

B by the approximating free response y
of

B.
Finite Time
2
Model Reduction
The nite time T,
2
norm of a systemB L
w
m,l
with an impulse response H is dened as
|B|

2
,T
= |H[
[0,T]
| =
_

T
t=0
|H(t)|
2
F
.
For a strictly stable system B, |B|

2
,
is well dened and is equal to its H
2
norm.
Assume that the given time series H
d
in the approximate realization problem is the
exact impulse response of a higher order system

B. Suchanassumptioncanbe made without
loss of generality because any nite time series H
d
(R
mp
)
T+1
can be considered as an
impulse response of a system in the model class L
w
m,T
. Then the approximate realization
problem can be interpreted as the following nite time
2
model reduction problem: i
i
i
i
11.5. Performance on real-life data sets 171
0 5 10 15 20
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
true
data
appr.
t
y
1
0 5 10 15 20
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
true
data
appr.
t
y
2
Figure 11.3. Output-only identication. Solid lineexact trajectory y, dotted linedata y
d
,
dashed lineapproximating trajectory y.
Given a system

B L
w
m,l
, a natural number l
red
< l, and a time horizon T, nd
a system

B L
w
m,l
red
that minimizes the nite time T,
2
norm |

B

B|

2
,T
of
the error system.
In the model reduction problem, the mist is due to the low order approximation.
In the approximate realization problem, assuming that the data is obtained from the EIV
model, the mist is due to the measurement errors

H. The solution methods, however,
are equivalent, so in this section we gave an alternative interpretation of the approximate
realization problem.
Example 11.11 (Finite time
2
model reduction) The high order system

B is a random
stable system(obtained via the MATLABdrss function) with m = 2 inputs, p = 2 outputs,
and lag l = 10. A reduced order model

Bwith lag l
red
= 1 is sought. The time horizon T
is chosen large enough for a sufcient decay of the impulse response of

B.
Figure 11.4 shows the tting of the impulse response of the high order system

B by
the impulse response of the reduced order system

B.
11.5 Performance on Real-Life Data Sets
The data base for system identication (DAISY) [DM05] is used for verication and com-
parison of identication algorithms. In this section, we apply the GlTLS method, described
in this chapter and implemented by the software presented in Appendix B.4, on data sets
from DAISY. First, we solve output error identication problems, and then, we consider
the data set Step response of a fractional distillation column, which consists of multiple
vector time series.
Single Time Series Data Sets
The considered data sets are listed in Table 11.2. Since all data sets are with a given
input/output partitioning, the only user-dened parameter that selects the complexity of the
model class M = L
m+p
m,l
is the lag l. i
i
i
i
172 Chapter 11. Approximate System Identication
0 10 20 30 40 50 60 70
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
given
appr.
t
H
1
1
0 10 20 30 40 50 60 70
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0.2
given
appr.
t
H
1
2
0 10 20 30 40 50 60 70
4
3
2
1
0
1
2
given
appr.
t
H
2
1
0 10 20 30 40 50 60 70
1.5
1
0.5
0
0.5
1
1.5
2
given
appr.
t
H
2
2
Figure 11.4. Finite time
2
model reduction. Solid lineimpulse response of the given
(high-order) system, dashed lineimpulse response of the reduced order system.
The data is detrended and split into identication and validation data sets. The rst
70% of the data, denoted by w
id
, is used for identication, and the remaining 30%, denoted
by w
val
, is used for validation.
Approximate models are computed via the following methods:
n4sid: the N4SID method implemented in the System Identication Toolbox of MAT-
LAB;
stlsident: the GlTLS method implemented by the STLS solver; and
pem: the prediction error method of the System Identication Toolbox of MATLAB.
Table 11.2. Examples from DAISY. Ttime horizon, mnumber of inputs, pnumber of
outputs, llag.
# Data set name T m p l
1 Heating system 801 1 1 2
2 Hair dryer 1000 1 1 5
3 Flexible robot arm 1024 1 1 4
4 Heat ow density 1680 2 1 2
5 Steam heat exchanger 4000 1 1 2 i
i
i
i
11.5. Performance on real-life data sets 173
The inputs are assumed exact, so that identication in the output error setting is considered.
The validation is performed in terms of the mist M
oe
(w
val
,

B) obtained on the validation
data set and the equivalent (see (FIT)) simulation t computed by the function compare.
Note 11.12 (About the usage of the methods) The pem function is called with the option
DisturbanceModel, None,
which species output error model structure. In addition, the options
nk, 0, LimitError, 0,
and alg are used to disable the default for pem feedthrough term set to zero, robusti-
cation of the cost function, and stability constraint. (The GlTLS method does not constrain
the model class by enforcing stability.)
With these options (for the single-output case), pem minimizes the output error mis-
t M
oe
. The stlsident function is called with the specication that the inputs are exact,
so that the GlTLS and prediction error methods solve equivalent identication problems.
For both functions, we set the same convergence tolerance (Tolerance,1e-10),
maximum number of iterations (Maxiter,100), and initial approximation (the model
obtained by n4sid).
The identied systems by n4sid, stlsident, and pemare compared in Table 11.3.
In all examples there is a good match between the models obtained with the stlsident
and pem functions. In addition, the output error optimal model outperforms the model
computed by the N4SID method. Since the criterion is checked on a part of the data that is
not used for identication, there is no a priori guarantee that the output error method will
outperform the N4SID method.
Identication from Step Response Measurements
Next, we consider the data set Step response of a fractional distillation column from
DAISY. It consists of three independent time series, each one with T = 250 data points.
The given data has a xed input/output partitioning with m = 3 inputs and p = 2 outputs,
so that an approximate model is sought in the model class L
5
2
. We further bound the
complexity of the model class by choosing the lag l = 2, so that the considered model class
is L
5
2,2
.
The step response data is special because it consists of multiple time series, the inputs
are exactly known, and the initial conditions are also exactly known. In order to take into
account the known zero initial conditions, we precede the given time series with l zero
samples. In order to take into account the exactly known inputs, we use the modication of
the GlTLS method for time series with exact variables. Multiple time series are processed
as explained in Section 11.3.
Figure 11.5 shows the data y (the measured step response) and the step response of
the optimal approximating system, computed by the GlTLS method. i
i
i
i
174 Chapter 11. Approximate System Identication
Table 11.3. Comparison of the models obtained by n4sid, stlsident, and pem.
# Data set name Function Fit % Mist
1 Heating system n4sid 51.9971 140.8018
stlsident 76.0491 70.2527
pem 76.0491 70.2527
2 Hair dryer n4sid 88.3265 1.5219
stlsident 90.8722 1.1900
pem 90.8772 1.1893
3 Flexible robot arm n4sid 29.5496 3.2480
stlsident 96.5454 0.1593
pem 96.5454 0.1593
4 Heat ow density n4sid 40.7249 11.2233
stlsident 83.8574 3.0565
pem 83.8574 3.0565
5 Steam heat exchanger n4sid 29.6890 25.5047
stlsident 60.4452 14.3481
pem 60.1575 14.4525
11.6 Conclusions
We generalized previous results on the application of STLS for system identication, ap-
proximate realization, and model reduction to multivariable systems. The STLS method
allows us to treat identication problems without input/output partitioning of the variables
and EIVidentication problems. Multiple time series, latent variables, and prior knowledge
about exact variables can be taken into account.
The classical identication problem, where the uncertainty is attributed solely to un-
observed inputs and the observed inputs are assumed exact, is a special case of the proposed
method. The relation and comparison with classical identication methods, however, have
not yet been investigated.
The software tool for solving STLS problems, presented in Appendix B.2, makes the
proposed identication method practically applicable. The performance of the software
package was tested on data sets from DAISY. The results show that examples with a few
thousands data points can be solved routinely and the optimization method is robust with
respect to an initial approximation obtained from a nonoptimization based method. i
i
i
i
11.6. Conclusions 175
50 100 150 200 250
2.39
2.41
2.43
2.45
x 10
4
t
y
1
1
,
y
1
1
50 100 150 200 250
0.284
0.2858
0.2875
t
y
1
2
,
y
1
2
50 100 150 200 250
1.19
1.2
1.21
1.22
x 10
4
t
y
2
1
,
y
2
1
50 100 150 200 250
0.142
0.1428
0.1436
t
y
2
2
,
y
2
2
50 100 150 200 250
1.11
1.14
1.17
1.2
x 10
5
t
y
3
1
,
y
3
1
50 100 150 200 250
1.43
1.44
1.45
t
y
3
2
,
y
3
2
Figure 11.5. Identication from step response measurements. Solid linegiven data y,
dashed lineGlTLS approximation y. (y
ij
is the step response from input i to output j.) i
i
i
i
176 Chapter 11. Approximate System Identication i
i
i
i
Chapter 12
Conclusions
We have promoted a framework for mathematical modeling in which
1. models are disentangled from their representations and
2. datamodel discrepancy is explained by correction of the data.
A basic question in our treatment is,
When does a model in a considered model class t the data exactly and how
can we construct such a model?
This exact modeling problem leads to the notion of the most powerful unfalsied model
and to identiability conditions, i.e., under what conditions the data generating model can
be recovered from the data. In the generic case when exact t is not possible, we propose
an approximate model based on mist minimization.
The mist approach corrects the data as little as necessary, so that the most powerful
unfalsied model for the corrected data belongs to the model class. The approximate model
is falsied whenever the data is not generated by a model in the model class, and the mist
is a quantitative measure of how much the model is falsied by the data.
In the errors-in-variables setting, the mist can be chosen to be the negative log
likelihood function. Such an interpretation is appealing and leads to a consistent estimator
for the true model. It requires, however, strong assumptions about the data that are rarely
veriable in practice. For this reason, the approximation point of view of the modeling
problem is often more appropriate than the stochastic estimation point of view.
Static approximation problems In the simplest special case of a linear static model
class and unweighted mist function, the mist minimization problem is the classical total
least squares problem. The abstract formulation is transformed to parameter optimization
problems by choosing concrete representations of the model. The commonly used repre-
sentations are kernel, image, and input/output.
Although concrete representations are indispensable for the actual solution of the
modeling problems, their usage in problem formulations is not natural. The abstract,
177 i
i
i
i
178 Chapter 12. Conclusions
representation-free formulation shows more directly what the aim of the problem is and
leaves the choice of the model representation open for the solution.
The classical total least squares problem is generalized in two directions: weighted
mist function and structured data matrix. Dening the problem abstractly and then trans-
lating it to concrete parameter optimization problems, we showed links among various,
seemingly unrelated, algorithms from the literature. We presented alternative algorithms
that in one respect or another outperform the existing methods. However, it is a topic of
future work to develop algorithms that combine all the virtues of the existing algorithms.
We presented a new exible formulation of the structured total least squares problem
that is general enough to cover various nontrivial applications and at the same time allows
efcient solution methods. Algorithms with linear computational complexity in the number
of data points were outlined and implemented in a software package.
Bilinear andquadratic approximationproblems are solvedbythe adjustedleast squares
method, which has an analytic solution in terms of an eigenvalue decomposition. The ad-
justed least squares method is a stochastic latency oriented method, so in these problems
we detour from the main line of the bookdeterministic mist approximation. The reason
is that the adjusted least squares method leads to a signicant computational simplication.
In addition, although the theory of the adjusted least squares estimation is asymptotic, sim-
ulation results show that the solution is very effective even for small sample size problems.
Dynamic approximation problems In the second part of the book, we considered
exact and approximate identication problems for nite time series. We gave a sharp
sufcient identiability condition: if the data generating system is controllable and an input
component of the time series is persistently exciting, the most powerful unfalsied model
of the data coincides with the data generating system. Exact identication algorithms nd
the data generating system by computing a representation of the most powerful unfalsied
model.
We proposed new algorithms for exact identication of a kernel, convolution, and
input/state/output representation of the most powerful unfalsied model. The latter are
closely related to the deterministic subspace algorithms. However, the algorithms proposed
in the book are more efcient and impose weaker assumptions on the given data. In addition,
we gave system theoretic interpretation of the oblique and orthogonal projections that are
used in the deterministic subspace identication.
For rough data, the exact identication problem generically identies a trivial system
that explains every time series. When a bound on the complexity of the identied system is
imposed, e.g., via a bound on the model class complexity (number of inputs and lags), the
exact identication problem generically has no solution. The mist approximate modeling
problem for the linear time-invariant model class is called the global total least squares
problem. It is the dynamic equivalent of the classical total least squares problem. We
solved the global total least squares problem by relating it to a structured total least squares
problem with a block-Hankel data matrix.
Approximate identication is classically considered in a stochastic setting with the
latency approach. This classical stochastic model (system with unobserved noise input),
however, like the errors-in-variables model imposes unveriable assumptions on the data.
In addition, the stochastic framework addresses very indirectly the approximation issue. i
i
i
i
Appendix A
Proofs
A.1 Weighted Total Least Squares Cost Function
Gradient
Denote by Di the differential operator. It acts on a differentiable function M
wtls
: U R,
where U is an open set in R
mp
, and gives as a result another function, the differential of
M
wtls
, Di(M
wtls
) : U R
mp
R. Di(M
wtls
) is linear in its second argument, i.e.,
Di(f) := d M
wtls
(X, H) = trace
_
M

wtls
(X)H

_
, (A.1)
where M

wtls
: U R
mp
is the derivative of M
wtls
, and has the property
M
wtls
(X +H) = M
wtls
(X) + d M
wtls
(X, H) + o(|H|
F
), (A.2)
for all X U and for all H R
mp
. The notation o(|H|
F
) has the usual meaning
g(H) = o(|H|
F
) : g(H)/|H|
F
0 as |H|
F
0.
We have
M
wtls
(X) =
N

i=1
e

i
(X)
1
i
(X)e
i
(X), where
i
(X) :=
_
X

W
1
i
_
X
I
_
.
We nd the derivative M

wtls
(X) by rst deriving the differential Di(M
wtls
) and then repre-
senting it in the form (A.1), from which M

wtls
(X) is extracted. The differential of M
wtls
is
d M
wtls
(X, H)
=
N

i=1
_
a

i
H
1
i
(X)e
i
(X) +e

i
(X)
1
i
(X)H

a
i
+e

i
(X) Di
_

1
i
(X)
_
e
i
(X)
_
=
N

i=1
_
2 trace
_
a
i
e

i
(X)
1
i
(X)H

_
+ trace
_
Di
_

1
i
(X)
_
e
i
(X)e

i
(X)
_
_
.
179 i
i
i
i
180 Appendix A. Proofs
Using the rule for differentiation of an inverse matrix valued function, we have
Di
_

1
i
(X)
_
=
1
i
(X) Di
_

i
(X)
_

1
i
(X).
Using the dening property (A.2), we have
Di
_

i
(X)
_
= Di
_
_
X
T
I

W
1
i
_
X
I
_ _
= trace
_
_
H

W
1
i
_
X
I
_
+
_
X I

W
i
_
H
0
_ _
= 2 trace
_
_
H

W
1
i
_
X
I
_ _
.
Let V
i
:= W
1
i
and dene the partitioning
V
i
=:
m p
_
V
a,i
V
ab,i
V
ba,i
V
b,i
_
m
p
.
Then
Di
_

i
(X)
_
= 2 trace
_
H

(V
a,i
X V
ab,i
)
_
.
Substituting backwards, we have
d M
wtls
(X, H) =
N

i=1
_
2 trace
_
a
i
e

i
(X)
1
i
(X)H

_
2 trace
_

1
i
(X)H

(V
a,i
X V
ab,i
)
1
i
(X)e
i
(X)e

i
(X)
_
_
= trace
__
2
N

i=1
_
a
i
e

i
(X)
1
i
(X)
(V
a,i
X V
ab,i
)
1
i
(X)e
i
(X)e

i
(X)
1
i
(X)
_
_
H

_
.
Thus
M

wtls
(X) = 2
N

i=1
_
a
i
e

i
(X)
1
i
(X) (V
a,i
X V
ab,i
)
1
i
(X)e
i
(X)e

i
(X)
1
i
(X)
_
.
A.2 Structured Total Least Squares Cost Function
Gradient
The differential Di(f
0
) is
Di(f
0
) := df
0
(X, H) = trace
_
f

0
(X)H

_
(A.3) i
i
i
i
A.3. Fundamental lemma 181
and has the property
f
0
(X +H) = f
0
(X) + df
0
(X, H) + o(|H|
F
)
for all X U and for all H R
nd
. The function f

0
: U R
nl
is the derivative of f
0
.
As in Appendix A.1, we compute it by deriving the differential Di(f
0
) and representing
it in the form (A.3), from which f

0
(X) is extracted.
The differential of the cost function f
0
(X) = r

(X)
1
(X)r(X) is (using the rule
for differentiation of an inverse matrix)
df
0
(X, H) = 2r

1
_

_
H

a
1
.
.
.
H

a
m
_

_ r

1
_
d(X, H)
_

1
r.
The differential of the weight matrix
= V
r
= E r r

= E
_

_
X

a
1

b
1
.
.
.
X

a
m

b
m
_

_
_
a

1
X

1
a

m
X

,
where

A

=:
_
a
1
a
m

, a
i
R
n
and

B

=:
_

b
1
b
m

,

b
i
R
d
is
d(X, H) = E
_

_
H

a
1
.
.
.
H

a
m
_

_ r

+E r
_
a

1
H a

m
H

. (A.4)
With M
ij
R
dd
denoting the (i, j)th block of
1
,
df
0
(X, H) = 2
_
_
N

i,j=1
r

i
M
ij
H

a
j

m

i,j,k,l=1
r

l
M
li
H

E a
i
c

j
X
ext
M
jl
r
l
_
_
= 2 trace
_
_
_
m

i,j=1
a
j
r

i
M
ij

m

i,j,k,l=1
_
I 0

V
c,ij
X
ext
M
jl
r
l
r

l
M
li
_
H

_
_
,
so that
f

0
(X) = 2
_
_
m

i,j=1
a
j
r

i
M
ij

m

i,j=1
_
I 0

V
c,ij
X
ext
N
ji
_
_
,
where N
ji
(X) :=

m
l=1
M
jl
r
l

m
l=1
r

l
M
li
.
A.3 Fundamental Lemma
Of course, N
l
B
ker
_
H

l
( w)
_
. Assume by contradiction that ker
_
H

l
( w)
_
,= N
l
B
.
Then there is a lowest degree polynomial r R
w
[z], r(z) =: r
0
+ r
1
z + + r
l1
z
l1
,
that annihilates H

l
( w), i.e.,
col

(r
0
, r
1
, . . . , r
l
)H
l
( w) = 0, i
i
i
i
182 Appendix A. Proofs
but is not an element of N
l
B
.
Consider H
l+n
( w). Then
ker
_
H

l+n
( w)
_
= image
_
r
(1)
(z), zr
(1)
(z), . . . , z
l+n
1
r
(1)
(z) ; . . . ;
r
(p)
(z), zr
(p)
(z), . . . , z
l+n
p
1
r
(p)
(z) ; r(z), zr(z), . . . , z
n
r(z)
_
.
Note that r(z), zr(z), . . . , z
n
r(z) are additional elements due to the extra annihilator r.
If all these polynomial vectors were linearly independent on R, then the dimension of
ker
_
H
l+n
( w)
_
would be (at least) p(l +n)+1. But the persistency of excitation assumption
implies that the number of linearly independent rows of H
l+n
( w) is at least m(l +n), so that
dim
_
ker
_
H
l+n
( w)
_
_
p(l + n).
Therefore, not all of these elements are linearly independent. By Lemma 7.5 and the
assumption that Ris row proper, the generators r
(1)
, . . . , r
(p)
and all their shifts are linearly
independent. It follows that there is 1 k n, such that
z
k
r(z) image
_
r
(1)
(z), zr
(1)
(z), . . . , z
l+n
1
r
(1)
(z) ; . . . ;
r
(p)
(z), zr
(p)
(z), . . . , z
l+n
p
1
r
(p)
(z) ; r(z), zr(z), . . . , z
k1
r(z)
_
.
Therefore, there are g R[z] of degree k 1 and f R
1p
[z], such that
g(z)r(z) = f(z)R(z).
Let be a root of g(z). Then f()R() = 0, but by the controllability assumption,
rank
_
R()
_
= p for all C and, consequently, f() = 0. Therefore, with
g(z) = (z )g

(z) and f(z) = (z )f

(z),
we obtain
g

(z)r(z) = f

(z)R(z).
Proceeding with this degree lowering procedure yields r(z) = f(z)R(z) and contradicts
the assumption that r was an additional annihilator of H
l
(w). Therefore, H
l
(w) had the
correct left kernel and therefore N
l
B
= ker
_
H

l
( w)
_
.
A.4 Recursive Errors-in-Variables Smoothing
By the dynamic programming principle (10.9),
V
t
_
x(t)
_
= min
u(t)
_
_
u(t)
1
_

_
V
1
u
V
1
u
u
d
(t)
u
d
(t)

V
1
u
u
d
(t)
_ _
u(t)
1
_
+
_
x(t)
1
_

_
C

V
1
y
C C

V
1
y
y
d
(t)
y
d
(t)

V
1
y
y
d
(t)
_ _
x(t)
1
_
+V
t+1
_
A x(t) +B u(t)
_
_
, (A.5) i
i
i
i
A.4. Recursive errors-in-variables smoothing 183
where the s indicate the symmetric blocks in the matrices. Using induction, we prove that
the value function V
t
is quadratic for all t. At the nal moment of time T, V
T
0 and thus
it is trivially quadratic. Assume that V
t+1
is quadratic for t 0, 1, . . . , T. Then there are
P
t+1
R
nn
, s
t+1
R
n1
, and v
t+1
R
11
, such that
V
t+1
_
x(t)
_
=
_
x(t)
1
_

_
P
t+1
s
t+1
s

t+1
v
t+1
_ _
x(t)
1
_
, for all x(t). (A.6)
From (A.5) and (A.6), we have
V
t
_
x(t)
_
= min
u(t)
_
_
u(t)
1
_

_
V
1
u
V
1
u
u
d
(t)
u
d
(t)

V
1
u
u
d
(t)
_ _
u(t)
1
_
+
_
x(t)
1
_

_
C

V
1
y
C C

V
1
y
y
d
(t)
y
d
(t)

V
1
y
y
d
(t)
_ _
x(t)
1
_
+
_
A x(t) +B u(t)
1
_

_
P
t+1
s
t+1
s

t+1
v
t+1
_ _
A x(t) +B u(t)
1
_
_
. (A.7)
The function to be minimized in (A.7) is a convex quadratic function of u(t),
_
u(t)
1
_

_
_
B

P
t+1
B +V
1
u
B

P
t+1
A x(t) +B

s
t+1
V
1
u
u
d
(t)

_
x(t)
1
_

M(t)
_
x(t)
1
_
_
_
_
u(t)
1
_
,
where
M(t) :=
_
A

P
t+1
A+C

V
1
y
C A

s
t+1
C

V
1
y
y
d
(t)
v
t+1
+y
d
(t)

V
1
y
y
d
(t) +u
d
(t)

V
1
u
u
d
(t)
_
so the minimizing u(t) is
u(t) =
_
B

P
t+1
B +V
1
u
_
1
_
B

P
t+1
A x(t) +B

s
t+1
V
1
u
u
d
(t)
_
. (A.8)
Substituting (A.8) back into (A.5), we have
V
t
( x(t)) =
_
u(t)
1
_ _
B

P
t+1
B +V
1
u
B

P
t+1
A x(t) +B

s
t+1
V
1
u
u
d
(t)
u
d
(t)

V
1
u
u
d
(t)
_ _
u(t)
1
_
+
_
x(t)
1
_

_
A

P
t+1
A+C

V
1
y
C A

s
t+1
C

V
1
y
y
d
(t)
v
t+1
+y
d
(t)

V
1
y
y
d
(t)
_ _
x(t)
1
_
,
which is a quadratic function of x(t),
V
t
( x(t)) =
_
x(t)
1
_

_
P
t
s
t
s

t
v
t
_ _
x(t)
1
_
, for all x(t),
with P
t
and s
t
given in (10.11) and (10.12), respectively. By induction, V
t
is quadratic for
t = 0, 1, . . . , T. i
i
i
i i
i
i
i
Appendix B
Software
This appendix describes a software implementation of the algorithms in the book. Except
for the STLS solver, presented in Section B.2, all functions are written in MATLAB code.
For maximumefciency, the STLS solver is written in Cwith calls to BLAS, LAPACK, and
SLICOT. The C function, however, is also callable from MATLAB via a mex le interface.
The software and related information are available from the following address:
http://www.esat.kuleuven.be/imarkovs/book.html
B.1 Weighted Total Least Squares
Introduction
The weighted total least squares toolbox, presented in this section, contains MATLAB
functions (m-les) for data approximation by linear static models. The data is a col-
lection of N, d-dimensional real vectors d
1
, . . . , d
N
R
d
, gathered in a matrix D :=
_
d
1
d
N

R
dN
, and a linear static model B for D is a subspace of R
d
. The
natural number m := dim(B) is a measure of the model complexity and L
d
m,0
denotes the
set of all linear static models with d variables of dimension at most m.
A linear static model B L
d
m,0
can be represented as a kernel or image of a matrix
or in an input/output form; see Section 3.2. A representation of the model yields a param-
eterization. The model is described by equations that depend on parameter, and to a given
parameter corresponds a unique model. For a given model and a chosen representation,
however, the corresponding parameter might not be unique. The parameters R and P in
a kernel and image representation are in general not unique, but the parameter X in the
special input/output representation B
i/o
(X) is unique.
We use the shorthand notation
_
d
1
d
N

B U for d
i
B, i = 1, . . . , N.
If D B, the model B ts the data D exactly. If D , B, the model B ts the data D
only approximately. For optimal approximate modeling, the following mist function is
185 i
i
i
i
186 Appendix B. Software
Table B.1. Special cases of the weighted total least squares problem (WTLS).
Special case Name Acronym
W
i
=
2
I R
+
total least squares TLS
W
i
= diag(w) w R
d
+
element-wise generalized TLS EWGTLS
W
i
= W W > 0 generalized total least squares GTLS
M
wtls
= M
gtls2
W
l
, W
r
> 0 diag. EWGTLS with two side weighting EWGTLS2
M
wtls
= M
gtls2
W
l
, W
r
> 0 GTLS with two side weighting GTLS2
W
i
= diag(w
i
) w
i
R
d
+
element-wise weighted TLS EWTLS
adopted:
M
wtls
_ _
d
1
d
N

, B
_
:= min

d
1
,...,

d
N
B

_
N

i=1
(d
i


d
i
)

W
i
(d
i


d
i
),
where W
1
, . . . , W
N
are given positive denite matrices. The weighted total least squares
(WTLS) mist M
wtls
(D, B) between the data D and a model B L
d
m,0
is a measure of
how much the model fails to t the data exactly. The considered approximate modeling
problem is as follows:
Given the data matrix D =
_
d
1
d
N

R
dN
, a complexity bound m, and
positive denite weight matrices W
1
, . . . , W
N
, nd an approximate model

B
wtls
:= arg min

BL
d
m,0
M
wtls
(D,

B). (WTLS)
The special cases listed in Table B.1 allowfor special solution methods and are treated
separately.
Note B.1 (FWTLS) The followingweightedtotal least squares problem, calledfullyweighted
total least square (FWTLS) problem

B
fwtls
:= arg min

BL
d
m,0
min

B
vec

(D

D)W vec(D

D), where W R
dNdN
, W > 0,
is also considered. It includes (WTLS) as a special case with W = diag(W
1
, . . . , W
N
).
The FWTLS problem, however, does not allow for efcient computational methods and its
solution is prohibitive already for small sample size problems (say d = 10 and N = 100).
For this reason the FWTLS problem is not the central problem of interest and is included
only for completeness.
Algorithms
The special cases listed in Table B.1 have increased generality from top to bottom. The
more general the problem is, however, the more computationally expensive its solution is.
The TLS and GTLS problems allow for analytic solutions in terms of the SVD. The more i
i
i
i
B.1. Weighted total least squares 187
general EWTLS, WTLS, and FWTLS problems have no similar analytic solutions and use
less robust iterative solution methods.
The SVD method is computationally faster than the alternative iterative optimization
methods and theoretically characterizes all globally optimal solutions. In contrast, the
iterative optimization methods (used in the package) compute one locally optimal solution.
(The algorithm of Premoli and Rastello [PR02, MRP
+
05] is not globally convergent to a
local solution, so that for particular initial approximations this method might not converge
to a local solution. In such cases the algorithm diverges or oscillates.)
The GTLS-type problems (EWGTLS, GTLS, EWGTLS2, and GTLS2) are solved
in the package via the transformation technique of Theorem 3.18. The data matrix is
appropriately scaled and the corresponding TLS problemis solved for the scaled data. Then
the solution of the original problem is recovered from the solution of the transformed TLS
problem via the inverse transformation.
The general WTLS problem is solved via local optimization methods. The following
algorithms are used/implemented in the package:
1. classical local optimization methods (from the Optimization Toolbox of MATLAB),
2. an alternating least squares algorithm,
3. the algorithm of Premoli and Rastello.
Implementation
The implementation is in MATLAB code. For problems with analytic solution, the MAT-
LAB code is expected to compute a solution nearly as fast as an alternative code in C or
FORTRAN. The general WTLS algorithms, however, are expected to benet in terms of
execution time if implemented in C or FORTRAN. The MATLAB source code could be
consulted for the implementation details.
Overview of Commands
The package has three main groups of functions: transformations, mist compu-
tations, and approximations.
The transformationfunctions convert a givenrepresentationof a model toanequivalent
one. The considered representations are image, kernel, and input/output, so that there are
in total six transformation functions among them (see Figure 3.2). In addition, a kernel
or an image representation might not be minimal, so that functions that convert a given
kernel or image representation to a minimal one are added. The transformation functions
are summarized in Table B.2.
The mist computation functions are used for validation: they allow the user to verify
how well a given model ts given data in terms of a certain mist function. Since the
model can be specied by one of the three alternative representationskernel, image, or
input/outputall mist functions have three versions. The following naming convention is
adopted: mist computation functions begin with m (for mist), followed by the name of the
approximation problem (which identies the type of mist to be computed), followed by a i
i
i
i
188 Appendix B. Software
Function Description
x2r X R from input/output to kernel representation
x2p X P from input/output to image representation
r2p R P from kernel to image representation
p2r P R from image to kernel representation
r2x R X from kernel to input/output representation
p2x P X from image to input/output representation
minr R R
min
minimal kernel representation
minp P P
min
minimal image representation
Table B.2. Transformation functions.
letter indicating the model representation: r for kernel, p for image, and x for input/output.
Instead of a model B, an approximating matrix

D R
dN
can be used for the mist
computation. In this case the last letter of the function name is dh.
The considered mist functions are TLS, GTLS, GTLS2, WTLS, and FWTLS. The
element-wise versions of the GTLS, GTLS2, and WTLS mists are specied by the size of
the given weight matrices: if vectors are given in mgtls{r,p,x,dh} and mgtls2{r,
p,x,dh} instead of square weight matrices, then the EWGTLS and EWGTLS2 mists
are computed instead of the GTLS and GTLS2 ones. Similarly, if a d N matrix is
given instead of a d d N tensor in mwtls{r,p,x,dh}, then the EWTLS mist is
computedinsteadof the WTLSone. The general FWTLSmist is computedbythe functions
mwtls{r,p,x,dh} if the weight matrix is of size dN dN. The mist computation
functions are summarized in Table B.3.
The approximation functions compute a WTLSapproximation of the data. The special
WTLS problems are called by special functions that are more efcient; see Table B.4. As
in the mist computation, the element-wise versions of the functions are recognized by the
dimension of the weight matrices. The function wtls uses the quasi-Newton optimization
algorithm that seems to outperform the alternatives. The alternative methods can be called
by the corresponding functions; see Table B.5.
B.2 Structured Total Least Sqaures
The package uses MINPACKs LevenbergMarquardt algorithm [Mar63] for the solution
of the STLS problem (STLS
X
) with the structure specication of Assumption 4.4 in its
Table B.3. Mist computation functions.
Function Description
mtlsr mtlsp mtlsx mtlsdh TLS mist
mgtlsr mgtlsp mgtlsx mgtlsdh GTLS mist
mgtls2r mgtls2p mgtls2x mgtls2dh GTLS2 mist
mwtlsr mwtlsp mwtlsx mwtlsdh WTLS mist i
i
i
i
B.2. Structured total least sqaures 189
Table B.4. Approximation functions.
Function Description
tls TLS approximation
gtls GTLS approximation
gtls2 GTLS2 approximation
wtls WTLS approximation
equivalent formulation (4.4). There is no closed form expression for the Jacobian matrix
J = [r
i
/x
j
], where x = vec(X), so that the pseudo-Jacobian J
+
proposed in [GP96] is
used instead of J. Its evaluation is done with computational complexity O(m).
The software is written in ANSI Clanguage. For the vector-matrix manipulations and
for a Cversion of MINPACKs LevenbergMarquardt algorithm, we use the GNUScientic
Library (GSL). The computationally most intensive step of the algorithmthe Cholesky
decomposition of the block-Toeplitz, block-banded weight matrix (X)is performed via
the subroutine MB02GD from the SLICOT library [VSV
+
04]. By default, the optimization
algorithmis initialized with the TLS solution. Its computation is performed via the SLICOT
subroutine MB02MD.
The package contains
C-source code: stls.candstls.h(the functionstlsimplements Algorithm4.3);
MATLAB interface to the C function stls via C-mex le stls.m;
a demo le demo.m with examples that illustrate the application of the STLS solver;
user guide and papers that describe the STLS problem in more detail.
C Function
The function stls implements the method outlined in Section 4.5 to solve the STLS
problem (STLS
X
). Its prototype is
int stls(gsl_matrix* a, gsl_matrix* b, const data_struct* s,
gsl_matrix* x, gsl_matrix* v, opt_and_info* opt)
Table B.5. Auxiliary functions.
Function Description
wtlsini initial approximation for the WTLS approximation functions
wtlsap WTLS approximation by alternating projections
wtlsopt WTLS approximation by classical optimization methods
qncostderiv cost function and gradient for the quasi-Newton methods
lmcostderiv cost function and Jacobian for the LevenbergMarquardt method
wtlspr WTLS approximation by the algorithm of [MRP
+
05] i
i
i
i
190 Appendix B. Software
Description of the arguments:
a and b are the matrices A R
mn
and B R
md
, respectively, such that
_
A B

= S(p). We refer to the GSL reference manual for the denition of


gsl_matrix and the functions needed to allocate and initialize variables of this
type.
s is the structure description K, S of S(p). The type data_struct is dened in
stls.h as
/* structure of the data matrix C = [A B] */
#define MAXQ 10 /* maximum number of blocks in C */
typedef struct {
int K; /* = rowdim(block in T/H blocks) */
int q; /* number of blocks in C = [C1 ... Cq] */
struct {
char type; /* T-Toeplitz, H-Hankel, U-unstructured,
E-exact */
int ncol; /* number of columns */
int nb; /* = coldim(block in T/H blocks) */
} a[MAXQ]; /* q-element array describing C1,...,Cq; */
} data_struct;
x on input contains the initial approximation for the LevenbergMarquardt algorithm
and on exit, upon convergence of the algorithm, contains a local minimum point of
the cost function f
0
.
v on exit contains the error covariance matrix (J

+
J
+
)
1
of the vectorized estimate
x = vec(

X). It can be used for deriving condence bounds.
opt on input contains options that control the exit condition of the Levenberg
Marquardt algorithm and on exit contains information about the convergence of the
algorithm. The exit condition is
[x
(k+1)
j
x
(k)
j
[ < epsabs + epsrel [x
(k+1)
j
[, for all j = 1, . . . , nd, (B.1)
where x
(k)
, k = 1, 2, . . . , iter maxiter, are the successive iterates, and epsrel,
epsabs, maxiter are elds of opt. Convergence to the desired tolerance is indi-
cated by a positive value of opt.iter. In this case, opt.iter is the number of
iterations performed. opt.iter = -1 indicates lack of convergence. opt.time
and opt.fmin show the time in seconds used by the algorithm and the cost func-
tion f
0
value at the computed solution.
The type opt_and_info is dened in stls.h as
/* optimization options and output information structure */
typedef struct {
/* input options */
int maxiter; i
i
i
i
B.2. Structured total least sqaures 191
double epsrel, epsabs;
/* output information */
int iter;
double fmin;
double time;
} opt_and_info;
MATLAB Mex-File
The provided C-mex le allows us to call the C solver stls via the MATLAB command
>> [xh, info, v] = stls(a, b, s, x, opt);
The input arguments a, b, and s are obligatory. x and opt are optional and can be
skipped by the empty matrix []. In these cases their default values are used.
Description of the arguments:
a and b are the matrices A R
mn
and B R
md
, respectively, where
_
A B

=
S(p).
sis a q3 matrix or a structure with scalar eld kand a q3 matrix eld a. In the rst
case, K is assumed to be 1, and in the second case it is specied by s.k. The array S,
introduced in Section 4.6, is specied by s in the rst case and by s.a in the second
case. The rst column of s (or s.a) denes the type of the blocks C
(1)
, . . . , C
(q)
(1
block-Toeplitz, 2 block-Hankel, 3 unstructured, 4 exact), the second column denes
n
1
, . . . , n
q
, and the third column denes t
1
, . . . , t
q
.
x is a user-supplied initial approximation. Its default value is the TLS solution.
opt contains user-supplied options for the exit conditions. opt.maxiter denes
the maximum number of iterations (default 100), opt.epsrel denes the relative
tolerance epsrel (default 1e-5), and opt.epsabs denes the absolute tolerance
epsabs (default 1e-5); see (B.1).
xh is the computed solution.
info is a structure with elds iter, time, and fmin that gives information for the
termination of the optimization algorithm. These elds are the ones returned from
the C function.
v is the error covariance matrix (J

+
J
+
)
1
of the vectorized estimate x = vec(

X).
Compilation
The included make le, when called with argument mex, generates the MATLAB mex le.
The GSL, BLAS, and LAPACK libraries have to be installed in advance. For their location
and for the location of the mex command and options le, one has to edit the provided make
le. Precompiled mex-les are included for Linux only. i
i
i
i
192 Appendix B. Software
Table B.6. Elementary building blocks for the exact identication algorithms.
Function Description
w2r from time series to a kernel representation
r2pq from kernel representation to a left matrix fraction representation
pq2ss from left matrix fraction representation to an input/state/output represent.
uy2h computation of the impulse response
uy2hblk block computation of the impulse response
h2ss Kungs realization algorithm
uy2y0 computation of sequential free responses
uy2hy0 computation of the impulse response and sequential free responses
y02o from a set of free responses to an observability matrix
y02x from a set of sequential free responses to a state sequence
uyo2ss from data and observability matrix to an input/state/output representation
uyx2ss from data and a state sequence to an input/state/output representation
hy02xbal from the impulse response and sequential free responses to a balanced
state sequence
B.3 Balanced Model Identication
This section describes a MATLABimplementation of the algorithms for exact identication,
presented in Chapters 8 and 9. Although the algorithms were originally designed to work
with exact data, they can also be used as heuristic methods for approximate identication;
see Note 8.18. By specifying the parameters n
max
and l
max
lower than the actual order
and lag of the MPUM, the user obtains an approximate model in the model class L
w,n
max
m,l
max
.
Another approach for deriving an approximate model via the algorithms described in this
section is to do balanced model reduction of the MPUM; see Note 9.2.
The exact identication algorithms are decomposed into elementary building blocks
that have independent signicance. Table B.6 lists the building blocks together with short
descriptions. More details can be found in the documentation of the corresponding m-les.
Table B.7 shows the implementation of the algorithms in Chapters 8 and 9 in terms of
the elementary building blocks. Exceptions are Algorithms 9.5 and 9.6, which are included
for completeness and are implemented as described in the original sources.
B.4 Approximate Identication
We describe MATLAB functions (m-les) for approximate LTI system identication. A
discrete-time dynamical system B (R
w
)
Z
is a collection of trajectories (w-variables time
series w : Z R
w
). No a priori distinction of the variables in inputs and outputs is
made and the system is not a priori bound to a particular representation. The variables w
can be partitioned into inputs u (free variables) and outputs y (dependent variables) and the
systemcan be represented in various equivalent forms, e.g., the ubiquitous input/state/output
representation
x = Ax +Bu, y = Cx +Du. (I/S/O) i
i
i
i
B.4. Approximate identication 193
Table B.7. Implementation of the algorithms in Chapters 8 and 9.
Algorithm 8.1 w2r
Algorithm 8.2 w2rr2pqpq2ss
Algorithm 8.3 uy2hh2ss
Algorithm 8.4 uy2y0y02ouyo2ss
Algorithm 8.5 uy2y0y02xuyx2ss
Algorithm 8.6 uy2h_blk
Algorithm 8.7 uy2h
Algorithm 8.8 h2ss
Algorithm 8.9 uy2y0
Algorithm 9.1 uy2hy0hy02xbalx2ss
Algorithm 9.2 uy2hh2ss (= Algorithm 8.3)
Algorithm 9.3 uy2hh2ouyo2ss
Algorithm 9.4 uy2hy0hy02xbalx2ss (= Algorithm 9.1)
Algorithm 9.5 uy2ssvd
Algorithm 9.6 uy2ssmr
The number of inputs m, the number of outputs p, and the minimal state dimension n of an
input/state/output representation are invariant of the representation and in particular of the
input/output partitioning.
The class of nite dimensional LTI systems with w variables and at most m inputs
is denoted by L
w
m
. The number of inputs and the minimal state dimension specify the
complexity of the system in the sense that the dimension of the restriction of B to the
interval [1, T], where T n, is a (Tm + n)-dimensional subspace. Equivalently, the
complexity of the system can be specied by the input dimension and the lag of the system.
The lag of Bis the minimal natural number l, for which there exists an lth order difference
equation
R
0
w(t) +R
1
w(t + 1) + +R
l
w(t + l) = 0 (DE)
representation of the system, i.e., B = w [ (DE) holds . The subset of L
w
m
with lag at
most l is denoted by L
w
m,l
.
The considered identication problemis the global total least squares problem[RH95,
MWV
+
05]:
Given a time series w
d
(R
w
)
T
and a complexity specication (m, l), nd the
system

B := arg min
BL
w
m,l
M(w
d
, B), where M(w
d
, B) := min
wB
|w
d
w|

2
.
(GlTLS)
The number M(w
d
, B) is the mist between w
d
and B. It shows how much the model B
fails to explain the data w
d
. The optimal approximate modeling problem (GlTLS) aims
to nd the system

B in the model class L
w
m,l
that best ts the data according to the mist
criterion. i
i
i
i
194 Appendix B. Software
The software presented in Section B.2 for STLS problems is the core computational
tool for solving the system identication problem. In fact, the software presented in this
section can be viewed as an interface to the STLS solver for the purpose of LTI system
identication.
The STLS solver gives as a result a difference equation representation of the optimal
approximating system

B. The function stlsident, described next, converts the param-
eter

X to the parameters (

A,

B,

C,

D) of an input/state/output representation of

B. The
MATLAB code of the functions in the package can be consulted for the implementation
details.
Usage
The function
stlsident solves the approximate identication problem (GlTLS), and
the function
misfit computes the mist M(w
d
, B).
Both functions use the input/state/output representation (I/S/O) of the systems that are
returned as an output and accepted as an input, so that they can be viewed as implementations
of the following mappings:
stlsident: (w
d
, m, l) (

A,

B,

C,

D); and
misfit:
_
w
d
, (A, B, C, D)
_
(M, w
d
).
The following are a special case and extensions:
the specication m = 0 corresponds to an output-only system identication (

B au-
tonomous);
the functions work with multiple given time series w
k
=
_
w
k
(1), . . . , w
k
(T)
_
, k =
1, . . . , N; and
some elements of wcan be specied as exact, in which case they appear unmodied
in the approximation w.
Using a combination of these options, one can solve approximately the realization problem,
the nite time
2
model reduction problem(see [MWV
+
05, Section 5]), and the output error
identication problem. Examples are given in Sections 11.4 and 11.5.
Calling sequences
[ sysh, info, wh, xini ] = stlsident( w, m, l, opt );
Inputs:
w, the given time series w
d
; a real MATLAB array of dimension T wN, where T
is the number of samples, w is the number of variables, and N is the number of time
series; i
i
i
i
B.4. Approximate identication 195
m, the input dimension for the identied system;
l, the lag of the identied system;
opt, options for the optimization algorithm:
opt.exct, (default []) a vector of indices for exact variables;
opt.sys0, (default total least squares approximation), an initial approxima-
tion: an input/state/output representation of a system, given as the MATLAB
object ss (see help ss), with m inputs, w-m outputs, and order l*(w-m);
opt.disp, (default notify), level of displayed information about the
optimization process; the options are offsilent, notifyonly if not
converged, finalconvergence status, iterper iteration;
opt.maxiter (default 100), a maximum number of iterations;
opt.epsrel, opt.epsabs, and opt.epsgrad (default 10
5
) conver-
gence tolerances; the convergence condition is
[X
(k+1)
ij
X
(k)
ij
[ < opt.epsabs + opt.epsrel [X
(k+1)
ij
[, for all i, j
or |M

(X
(k+1)
)| < opt.epsgrad,
where X
(k)
, k = 1, 2, . . . , info.iter opt.maxiter, are the successive
iterates of the parameter X and M

(X
(k+1)
) is the gradient of the cost function
at the current iteration step.
Outputs:
sysh, an input/state/output representation of the identied system

B;
info information from the optimization solver:
info.M, the mist M(w
d
,

B);
info.time, the execution time for the STLS solver; not equal to the execution
time of stlsident;
info.iter, the number of iterations. Note info.iter = opt.maxiter
indicates lack of convergence to a desired convergence tolerance;
wh, the optimal approximating time series;
xini, a matrix whose columns are the initial condition, under which w
k
, k =
1, . . . , N, are obtained.
[ M, wh, xini ] = misfit( w, sys, exct );
Inputs:
w, the given time series w
d
, a real MATLAB array of dimensions T w N, where
T is the number of samples, w is the number of variables, and N is the number of
time series; i
i
i
i
196 Appendix B. Software
sys an input/state/output representation of a systemB, given as the MATLABobject
ss (see help ss), with w external variables (inputs and outputs) and of order which
is a multiple of the number of outputs;
exct (default []), a vector of indices for exact variables.
Outputs:
M, the mist M(w
d
, B);
wh, optimal approximating time series w;
xini, a matrix whose columns are the initial condition, under which w
k
, k =
1, . . . , N, are obtained.
The functions stlsidentuy and misfituy are versions of stlsident and
misfit that use an a priori given input/output partitioning of the variables. For details on
their usage, see their MATLAB help. i
i
i
i
Notation
Sets of numbers page
R, R
+
the set of real numbers, nonnegative real numbers
Z, N the set of integers, and natural numbers 0, 1, 2, . . .
2, 37
102, 35
Norms and extreme eigenvalue page
|x|, x R
n
2-norm of a vector
_

n
i=1
x
2
i
|A|, A R
mn
induced 2-norm min
x=1
|Ax|
|A|
F
, A R
mn
Frobenius norm
_
trace(AA

)
|w|, w (R
w
)
T
2-norm of a time series
_

T
t=1
|w(t)|
2
|w|, w (R
wN
)
T
2-norm of a matrix valued time series
_

T
t=1
|w(t)|
2
F

min
(A),
max
(A) minimum, maximum eigenvalue of a symmetric matrix
3
36
1
24
168
75
Matrix operations page
A

pseudoinverse
A

transpose of a matrix
vec(A) column-wise vectorization of a matrix
col(a, b) the column vector [
a
b
]
col dim(A) the number of columns of A
rowdim(A) the number of block rows of A
col span(A) the span of the columns of A (the image or range of A)
diag(v), v R
n
the diagonal matrix diag(v
1
, . . . , v
n
)
diag(V
1
, . . . , V
n
) (block-) diagonal matrix with diagonal blocks V
1
, . . . , V
n
Kronecker product AB := [a
ij
B]
element-wise (Hadamard) product AB := [a
ij
b
ij
]
Kronecker delta,
0
= 1 and
t
= 0 for all t ,= 0
53
18
52
2
21
126
9
30
41
41
30
58
Expectation, covariance, and normal distribution page
E, cov expectation, covariance operator
x N(m, V ) x is normally distributed with mean m and covariance V
59
31
197 i
i
i
i
Fixed symbols page
U universum of outcomes from an experiment
B model behavior
M model class
H
l
(w) Hankel matrix with l block rows; see (H )
S structure specication for the STLS problem
X
ext
:=
_
X
I

extended parameter in an input/output parameterization


16
16
16
120
50
52
LTI model class and invariants page
m(B) number of inputs of B
p(B) number of outputs of B
l(B) lag of B
n(B) order of B
L
w,n
m,l
:= B (R
w
)
Z
[ B is LTI , m(B) m w, l(B) l, n(B) n
105
105
105
105
113
If m, l, or n is not specied, the corresponding invariant m(B), l(B), or n(B) is not
bounded.
Miscellaneous page
: left-hand side is dened by the right-hand side
: right-hand side is dened by the left-hand side
the backwards shift operator f (t) = f(t + 1)
Acting on a vector or matrix, removes the rst block row.

the forward shift operator

f (t) = f(t 1)
Acting on a vector or matrix,

removes the last block row.


19
20
23
124
Abbreviations
ALS adjusted least squares
DAISY data base for system identication
EIV errors-in-variables
EWTLS element-wise weighted total least squares
GTLS generalized total least squares
GlTLS global total least squares
LS least squares
LTI linear time-invariant
MIMO multi-input multi-output
MPUM most powerful unfalsied model
SISO single-input single-output
STLS structured total least squares
SVD singular value decomposition
TLS total least squares
WLS weighted least squares
WTLS weighted total least squares i
i
i
i
Bibliography
[AMH91] T. Abatzoglou, J. Mendel, and G. Harada. The constrained total least squares
technique and its application to harmonic superresolution. IEEETrans. Signal
Process., 39:10701087, 1991.
[AY70a] M. Aoki and P. Yue. On a priori error estimates of some identication methods.
IEEE Trans. Automat. Control, 15(5):541548, 1970.
[AY70b] M. Aoki and P. Yue. On certain convergence questions in systemidentication.
SIAM J. Control, 8(2):239256, 1970.
[BHN99] R. Byrd, M. Hribar, and J. Nocedal. An interior point algorithmfor large-scale
nonlinear programming. SIAM J. Optim., 9(4):877900, 1999.
[BM86] Y. Bresler and A. Macovski. Exact maximumlikelihood parameter estimation
of superimposed exponential signals in noise. IEEE Trans. Acust., Speech,
Signal Process., 34:10811089, 1986.
[Boo79] F. L. Bookstein. Fitting conic sections to scattered data. Computer Graphics
and Image Processing, 9:5971, 1979.
[Bro70] R. Brockett. Finite Dimensional Linear Systems. John Wiley, New York,
1970.
[Cad88] J. Cadzow. Signal enhancementA composite property mapping algorithm.
IEEE Trans. Signal Process., 36:4962, 1988.
[CRS95] R. Carroll, D. Ruppert, and L. Stefanski. Measurement Error in Nonlinear
Models. Chapman & Hall/CRC, London, 1995.
[CST00] C. Cheng, H. Schneeweiss, and M. Thamerus. A small sample estimator for
a polynomial regression with errors in the variables. J. R. Stat. Soc. (Ser. Stat.
Methodol. B), 62:699709, 2000.
[DGS03] R. Diversi, R. Guidorzi, and U. Soverini. Kalman ltering in symmetrical
noise environments. In Proceedings of the 11th IEEE Mediteranean Confer-
ence on Control and Automation, Rhodes, Greece, 2003.
[DM93] B. De Moor. Structured total least squares and L
2
approximation problems.
Linear Algebra Appl., 188189:163207, 1993.
199 i
i
i
i
200 Bibliography
[DM94] B. De Moor. Total least squares for afnely structured matrices and the noisy
realization problem. IEEE Trans. Signal Process., 42(11):31043113, 1994.
[DM03] B. De Moor. On the number of rows and columns in subspace identication
methods. In Proceedings of the 13th IFAC Symposium on System Identica-
tion, pages 17961801, Rotterdam, The Netherlands, 2003.
[DM05] B. De Moor. DaISy: Database for the identication of systems. Dept. EE,
K.U.Leuven, www.esat.kuleuven.be/sista/daisy/, 2005.
[DR94] B. De Moor and B. Roorda. L
2
-optimal linear systemidentication structured
total least squares for SISO systems. In Proceedings of the 33rd Conference
on Decision and Control, pages 28742879, Lake Buena, FL, 1994.
[EY36] G. Eckart and G. Young. The approximation of one matrix by another of lower
rank. Psychometrika, 1:211218, 1936.
[FPF99] A. Fitzgibbon, M. Pilu, and R. Fisher. Direct least-squares tting of ellipses.
IEEE Trans. Pattern Anal. Machine Intelligence, 21(5):476480, 1999.
[Ful87] W. Fuller. Measurement Error Models. Wiley, New York, 1987.
[FW97] F. Fagnani and J. C. Willems. Deterministic Kalman ltering in a behavioral
framework. Control Lett., 32:301312, 1997.
[Gal82] P. Gallo. Consistency of regression estimates when some variables are subject
to error. Comm. Statist. ATheory Methods, 11:973893, 1982.
[GDS03] R. Guidorzi, R. Diversi, and U. Soverini. Optimal errors-in-variables ltering.
Automatica, 39:281289, 2003.
[GGS94] W. Gander, G. Golub, and R. Strebel. Fitting of circles and ellipses: Least
squares solution. BIT, 34:558578, 1994.
[GP96] P. Guillaume and R. Pintelon. A GaussNewton-like optimization algorithm
for weighted nonlinear least-squares problems. IEEE Trans. Signal Pro-
cess., 44(9):22222228, 1996.
[GV80] G. Golub and C. Van Loan. An analysis of the total least squares problem.
SIAM J. Numer. Anal., 17:883893, 1980.
[Har97] R. Hartley. In defense of the eight-point algorithm. IEEE Trans. Pattern Anal.
Machine Intelligence, 19(6):580593, June 1997.
[HS99] C. Heij and W. Scherrer. Consistency of system identication by global total
least squares. Automatica, 35:9931008, 1999.
[Kan94] K. Kanatani. Statistical bias of conic tting and renormalization. IEEE Trans.
Pattern Anal. Machine Intelligence, 16(3):320326, 1994.
[KM00] A. Kukush and E.-O. Maschke. The efciency of adjusted least squares in the
linear functional relationship. DP208, SFB 386, Univ. of Munich, 2000. i
i
i
i
Bibliography 201
[KMV02] A. Kukush, I. Markovsky, and S. Van Huffel. Consistent fundamental matrix
estimation in a quadratic measurement error model arising in motion analysis.
Comput. Statist. Data Anal., 41(1):318, 2002.
[KMV03] A. Kukush, I. Markovsky, and S. Van Huffel. Consistent estimation in the bi-
linear multivariate errors-in-variables model. Metrika, 57(3):253285, 2003.
[KMV04] A. Kukush, I. Markovsky, and S. Van Huffel. Consistent estimation in an
implicit quadratic measurement error model. Comput. Statist. Data Anal.,
47(1):123147, 2004.
[KMV05] A. Kukush, I. Markovsky, and S. Van Huffel. Consistency of the structured
total least squares estimator in a multivariate errors-in-variables model. J.
Statist. Plann. Inference, 133(2):315358, 2005.
[Kun78] S. Kung. A new identication method and model reduction algorithm via
singular value decomposition. In Proceedings of the 12th Asilomar Confer-
ence on Circuits, Systems, and Computers, pages 705714, Pacic Grove,
CA, 1978.
[KV04] A. Kukush and S. Van Huffel. Consistency of elementwise-weighted total
least squares estimator in a multivariate errors-in-variables model AX = B.
Metrika, 59(1):7597, 2004.
[KZ02] A. Kukush and S. Zwanzig. On consistent estimators in nonlinear functional
EIV models. In Van Huffel and Lemmerling [VL02], pages 145155.
[LD01] P. Lemmerling and B. De Moor. Mist versus latency. Automatica, 37:2057
2067, 2001.
[Lev64] M. Levin. Estimation of a system pulse transfer function in the presence of
noise. IEEE Trans. Automat. Control, 9:229235, 1964.
[Lju99] L. Ljung. System Identication: Theory for the User. Prentice-Hall, Upper
Saddle River, NJ, 1999.
[LM00] Y. Leedan and P. Meer. Heteroscedastic regression in computer vision: Prob-
lems with bilinear constraint. Int. J. Comput. Vision, 37(2):127150, 2000.
[LMV00] P. Lemmerling, N. Mastronardi, and S. Van Huffel. Fast algorithm for solv-
ing the Hankel/Toeplitz structured total least squares problem. Numerical
Algorithms, 23:371392, 2000.
[Mar63] D. Marquardt. An algorithm for least-squares estimation of nonlinear param-
eters. SIAM J. Appl. Math., 11:431441, 1963.
[MD03] I. Markovsky and B. De Moor. Linear dynamic ltering with noisy input and
output. In Proceedings of the 13th IFAC Symposium on System Identication,
pages 17491754, Rotterdam, The Netherlands, 2003. i
i
i
i
202 Bibliography
[MLV00] N. Mastronardi, P. Lemmerling, and S. Van Huffel. Fast structured total
least squares algorithm for solving the basic deconvolution problem. SIAM J.
Matrix Anal. Appl., 22:533553, 2000.
[MM98] M. Mhlich and R. Mester. The role of total least squares in motion analy-
sis. In H. Burkhardt, editor, Proceedings of the 5th European Conference on
Computer Vision, pages 305321. Springer-Verlag, 1998.
[MMH03] J. Manton, R. Mahony, and Y. Hua. The geometry of weighted low-rank
approximations. IEEE Trans. Signal Process., 51(2):500514, 2003.
[Moo81] B. Moore. Principal component analysis in linear systems: Controllability,
observability and model reduction. IEEE Trans. Automat. Control, 26(1):17
31, 1981.
[MR93] M. Moonen and J. Ramos. A subspace algorithm for balanced state space
system identication. IEEE Trans. Automat. Control, 38:17271729, 1993.
[MRP
+
05] I. Markovsky, M.-L. Rastello, A. Premoli, A. Kukush, and S. Van Huffel. The
element-wise weighted total least squares problem. Comput. Statist. Data
Anal., 50(1):181209, 2005.
[MV06] I. Markovsky and S. Van Huffel. On weighted structured total least squares.
In I. Lirkov, S. Margenov, and J. Wa sniewski, editors, Proceedings of the 5th
International Conference on "Large-Scale Scientic Computations", volume
3743 of Lecture notes in computer science, pages 695702. SpringerVerlag,
Berlin, 2006.
[MWD02] I. Markovsky, J. C. Willems, and B. De Moor. Continuous-time errors-in-
variables ltering. In Proceedings of the 41st Conference on Decision and
Control, pages 25762581, Las Vegas, NV, 2002.
[MWD05] I. Markovsky, J. C. Willems, and B. De Moor. State representations from
nite time series. In Proceedings of the 44th Conference on Decision and
Control, pages 832835, Seville, Spain, 2005.
[MWRM05] I. Markovsky, J. C. Willems, P. Rapisarda, and B. De Moor. Data driven
simulation with applications to system identication. In Proceedings of the
16th IFAC World Congress, Prague, Czech Republic, 2005.
[MWV
+
05] I. Markovsky, J. C. Willems, S. Van Huffel, B. De Moor, and R. Pintelon. Ap-
plication of structured total least squares for system identication and model
reduction. IEEE Trans. Automat. Control, 50(10):14901500, 2005.
[Nie01] Y. Nievergelt. Hyperspheres and hyperplanes tting seamlessly by algebraic
constrained total least-squares. Linear Algebra Appl., 331:4359, 2001.
[Nie02] Y. Nievergelt. Anite algorithmto t geometrically all midrange lines, circles,
planes, spheres, hyperplanes, and hyperspheres. Numer. Math., 91:257303,
2002. i
i
i
i
Bibliography 203
[NS48] J. Neyman and E. Scott. Consistent estimates based on partially consistent
observations. Econometrica, 16(1):132, 1948.
[PR02] A. Premoli and M.-L. Rastello. The parametric quadratic form method for
solving TLS problems with elementwise weighting. In Van Huffel and Lem-
merling [VL02], pages 6776.
[Pra87] V. Pratt. Direct least-squares tting of algebraic surfaces. ACM Computer
Graphics, 21(4):145152, 1987.
[PS01] R. Pintelon and J. Schoukens. System Identication: A Frequency Domain
Approach. IEEE Press, Piscataway, NJ, 2001.
[PW98] J. PoldermanandJ. C. Willems. IntroductiontoMathematical Systems Theory.
Springer-Verlag, New York, 1998.
[RH95] B. Roorda and C. Heij. Global total least squares modeling of multivariate
time series. IEEE Trans. Automat. Control, 40(1):5063, 1995.
[Roo95] B. Roorda. Algorithms for global total least squares modelling of nite mul-
tivariable time series. Automatica, 31(3):391404, 1995.
[RPG96] J. Rosen, H. Park, and J. Glick. Total least norm formulation and solution of
structured problems. SIAM J. Matrix Anal. Appl., 17:110126, 1996.
[RW97] P. Rapisarda and J. C. Willems. State maps for linear systems. SIAMJ. Control
Optim., 35(3):10531091, 1997.
[SKMH05] S. Shklyar, A. Kukush, I. Markovsky, and S. Van Huffel. On the conic section
tting problem. Journal of Multivariate Analysis, 2005.
[SLV04] M. Schuermans, P. Lemmerling, and S. Van Huffel. Structured weighted low
rank approximation. Numer. Linear. Algebra Appl., 11:609618, 2004.
[SLV05] M. Schuermans, P. Lemmerling, and S. Van Huffel. Block-row hankel
weighted low rank approximation. Numer. Linear. Algebra Appl., to appear,
2005.
[SMWV05] M. Schuermans, I. Markovsky, P. Wentzell, and S. Van Huffel. On the equiv-
alence between total least squares and maximum likelihood PCA. Analytica
Chimica Acta, 544:254267, 2005.
[Sp97] H. Spth. Orthogonal least squares tting by conic sections. In S. Van Huffel,
editor, Recent Advances in Total Least Squares Techniques and Errors-in-
Variables Modeling, pages 259264. SIAM, Philadelphia, 1997.
[TM97] P. Torr and D. Murray. The development and comparison of robust methods for
estimating the fundamental matrix. Int. J. Computer Vision, 24(3):271300,
1997. i
i
i
i
204 Bibliography
[VD92] M. Verhaegen and P. Dewilde. Subspace model identication, Part 1: The
output-error state-space model identication class of algorithms. Int. J. Con-
trol, 56:11871210, 1992.
[VD96] P. VanOverschee andB. De Moor. Subspace Identicationfor Linear Systems:
Theory, Implementation, Applications. Kluwer, Boston, 1996.
[VL02] S. Van Huffel and P. Lemmerling, editors. Kluwer, 2002.
[VPR96] S. Van Huffel, H. Park, and J. Rosen. Formulation and solution of struc-
tured total least norm problems for parameter estimation. IEEE Trans. Signal
Process., 44(10):24642474, 1996.
[VSV
+
04] S. Van Huffel, V. Sima, A. Varga, S. Hammarling, and F. Delebecque. High-
performance numerical software for control. IEEEControl Systems Magazine,
24:6076, 2004.
[VV91] S. Van Huffel and J. Vandewalle. The total least squares problem: Computa-
tional aspects and analysis. SIAM, Philadelphia, 1991.
[WAH
+
97] P. Wentzell, D. Andrews, D. Hamilton, K. Faber, and B. Kowalski. Maximum
likelihoodprinciple component analysis. J. Chemometrics, 11:339366, 1997.
[Wil86a] J. C. Willems. From time series to linear systemPart I. Finite dimensional
linear time invariant systems. Automatica, 22(5):561580, 1986.
[Wil86b] J. C. Willems. From time series to linear systemPart II. Exact modelling.
Automatica, 22(6):675694, 1986.
[Wil87] J. C. Willems. From time series to linear systemPart I. Finite dimensional
linear time invariant systems, Part II. Exact modelling, Part III. Approximate
modelling. Automatica, 22, 23:561580, 675694, 87115, 1986, 1987.
[Wil91] J. C. Willems. Paradigms and puzzles in the theory of dynamical systems.
IEEE Trans. Automat. Control, 36(3):259294, 1991.
[WR02] J. C. Willems and P. Rapisarda. Balanced state representations with polyno-
mial algebra. In A. Rantzer and C. I. Byrnes, editors, Directions in mathemat-
ical systems theory and optimization, chapter 25, pages 345357. Springer-
Verlag, 2002.
[WRMM05] J. C. Willems, P. Rapisarda, I. Markovsky, and B. De Moor. A note on per-
sistency of excitation. Control Lett., 54(4):325329, 2005.
[Zha97] Z. Zhang. Parameter estimation techniques: A tutorial with application to
conic tting. Image and Vision Computing Journal, 15(1):5976, 1997. i
i
i
i
Index
adjusted least squares, 22, 2526, 7277,
8083, 9093
algebraic tting, 5, 85
alternating least squares, 4142, 161, 164
annihilating behavioral equations, 102
annihilator, 113, 146, 163, 182
approximate
identication, 122, 142, 159174, 192
left kernel, 122
rank revealing factorization, 132
realization, 132, 160, 168
ARMAX, 2, 7, 117
autonomous, see model, autonomous
axiom of state, 107
backward shift operator, 23
balanced
approximation, 131
error bounds, 143
nite time, 142
representation, 141
truncation, 159
behavior B, 16, 101
dual, 114
full, 106
manifest, 106
behavioral approach, 9, 16
behavioral equations, 102
bilinear model, 21, 7184
causality, 105
CayleyHamilton theorem, 112
chemometrics, 26, 33
Cholesky factor, 61
condence bounds, 64
consistency, 21, 32, 75, 82, 93, 164
constrained total least squares, 50
controllability
extended matrix, 111
gramian, 132
index, 112
convolution, 109
DAISY, 161, 171173
data driven simulation, 130
deconvolution, 66
delta function, , 58, 131
displacement rank, 50
dual behavior, 114, 164
dynamic programming, 154, 182
eigenvalue decomposition, 22, 93
eight-point algorithm, 78
element-wise product , 30
element-wise WTLS, 30
elimination theorem, 106
epipolar constraint, 78
equation
error, 1, 18
mist, 19
errors-in-variables model, 20
bilinear, 72
element-wise weighted, 31
quadratic, 86
state estimation, 151
structured, 58
weighted, 31
exact identication, 115, 120
ltering, 153
frequency domain, 109
fully weighted TLS, 186
fundamental lemma, 120
fundamental matrix, 78
205 i
i
i
i
206 Index
generalized TLS, 20
mist computation, 39
solution, 37
geometric tting, 4, 86
global TLS, 24, 161
GNU scientic library, 189
Hankel low-rank approximation, 66
Hankel matrix, 120
identiability, 119121
Identication Toolbox of MATLAB, 136,
165
image representation, see representation,
image
impulse response, 109, 125, 136
input cardinality, 105
input/output partitioning, 34, 116, 165
inverse power iteration, 40
Kalman lter, 151
kernel representation, see representation,
kernel
KoopmansLevins method, 161
Kung, 132, 138, 160, 168, 192
lag, 103
LAPACK, 61
latency, 1, 19, 157, 167
latent variables, 106
least squares, 21, 8890
left prime, 108
LevenbergMarquardt algorithm, 64, 188
linear time-invariant systems, 101
low rank approximation
structured, 160
low-rank approximation
Hankel, 66
structured, 50
weighted, 40
manifest variables, 106
Markovparameters, see impulse response
MATLAB, 123, 185
matrix approximation theorem, 36, 81
matrix fraction representation, 123, 165
maximally free variable, 105
maximumlikelihood, 20, 31, 49, 58, 152,
159
maximumlikelihoodprinciple component
analysis, 40
measurement error, 20
minimal representation, 105
MINPACK, 188
mixed LSTLS, 66
model
autonomous, 108, 131, 169
bilinear, 21, 7184
causal, non-anticipating, 105
complete, 102
complexity, 113
controllable, 108
EIV, see errors-in-variables
exact, unfalsied, 16
linear time-invariant, 101103
most powerful unfalsied, 16, 117
observable, 106
quadratic, 22
time-invariant, 102
model class M, 16
model reduction, 142, 170
module, 114
MOESP, 133, 143
motion analysis, 26
N4SID, 135
nonanticipation, 105
nongeneric TLS problem, 36
oblique projection, 135, 146
observability
extended matrix, 112
gramian, 132
index, 112
observability matrix, 124
Optimization Toolbox of MATLAB, 187
orthogonal projection, 133
orthogonal regression, 20, 86, 96
outcome, 16
output cardinality, 105
output error identication, 165
parameter optimization, 18 i
i
i
i
Index 207
parameters, 18
partial least squares, 73
persistency of excitation, 120
pointwise convergence, 103
prediction error methods, 165, 172
processing, 105
pseudo-Jacobian, 63
QR factorization, 127, 135
quadratic model, 22
quasi-Newton method, 45, 63
rank
numerical, 117, 131
revealing factorization, 125, 132
realizability, 130
realization theory, 130132, 168
recursive smoothing, 154
regularization, 69
relative error TLS, 30
representation, 18, 3334
balanced, 141
convolution, 109, 123
image, 33
minimal, 33, 109
input/output, 21, 34, 105106
input/state/output, 107, 165
isometric, 161, 164
kernel, 19, 33, 103105, 122
equivalent, 104
minimal, 33, 104
matrix fraction, 123, 165
shortest lag, 104
state space, 106107
response
free, 124
parameterization, 111
sequential free, 125
Riccati equation, 155
Riemannian SVD, 40, 50
row proper matrix, 104
s-dependence, 59
Schur algorithm, 50
score equation, 73
shift equation, 124, 132, 143
shift operator, 23
shift-and-cut operator, 146
signal space, 101
SLICOT, 61, 189
smoothing, 152, 162
state
axiom, 107
map, 112
variables, 107
state estimation, 153
state space representation, see represen-
tation, state space
stationarity, 59
structure from motion, 78
structured EIV model, 58
structured matrix S(p), 24
structured TLS, 24, 4969
structured total least norm, 50
structured weighted TLS, 69
system identication, 23
time-invariant, see model, time-invariant
Toeplitz matrix, 112
total least squares, 20
efcient computation, 36
nongeneric, 36
solution, 36
transfer function, 165
unimodular matrix, 104
unit vector e
i
, 111
universum U , 16
validation, 187
variable
bound, 105
free, 105
latent, 106, 167
manifest, 106
state, 107
variation of constants formula, 111
weighted least squares, 35
weighted structured TLS, 69
weighted TLS, 20, 2948
Z-transform, Z, 108

You might also like