You are on page 1of 9

An Introduction

to Multivariate
Calibration and Analysis
handle large quantities of data are be-
coming a necessity for the practicing
analytical chemist. The number of
REPORT
books in this area (1-3) is increasing as
the awareness of the need for more so- any situation where multiple measure-
phisticated techniques grows. Howev- ments are acquired.
er, widespread use of the full set of ana- One example of an analytical prob-
lytical tools will be realized only when lem solved by near-IR reflectance anal-
the analyst is familiar with the general ysis is the estimation of the protein and
goals and advantages of the methods. moisture content of wheat samples. As
This REPORT serves as an introduc- with many analytical methods, this
tion to the area of multivariate anal- procedure consists of two phases: cali-
yses known as multivariate calibration. bration and prediction. The chemist
Our goal is to give the chemist insight begins by constructing a data matrix
into the workings of a collection of sta- (R) from the instrument responses at
tistical techniques and to enable him or selected wavelengths for a given set of
her to judge the appropriateness of the calibration samples. In the case of
techniques for an application of inter- near-IR reflectance analysis, the R ma-
est. The list of methods described is by trix can be constructed from the loga-
no means a complete representation of rithmic reflectances (log(l/ref)) or
those that can be applied to multivari- from some other combination of reflec-
ate data. Also, we do not compare the tances obtained at the various wave-
results obtained using one method over lengths (4). A matrix of concentration
another; instead, we describe the meth- values (C) is then formed using inde-
ods in the line of a tutorial. For this pendent or referee methods such as
reason, the examples chosen illustrate Kjeldahl nitrogen analysis for protein
how the methods work rather than determination and oven-drying for the
Kenneth R. Beebe compare among them. The three multi- determination of moisture content.
Bruce R. Kowalski variate methods that will be discussed The goal of the calibration phase is to
Department of Chemistry, BG-10 are multiple linear regression (MLR), produce a model that relates the near-
Laboratory for Chemometrics principal component regression (PCR), IR spectra to the values obtained by
University of Washington
and partial least squares (PLS). MLR, the independent or referee methods.
Seattle, Wash. 98195
an extension of ordinary least Figure 1 illustrates the resulting ma-
squares, is the easiest to understand trices for this example. R is a 3 X 8
The advent of the laboratory computer and also the most commonly used; matrix with three rows of near-IR spec-
has allowed analytical chemists to col- PCR and PLS have yet to achieve tra from the analysis of three samples
lect enormous amounts of data on a widespread acceptance in chemistry. and eight columns corresponding to
wide variety of problems of interest. the eight near-IR wavelengths chosen
With this ability, however, has come Calibration and prediction for the analysis. (Throughout this RE-
the realization that more data does not Multivariate statistics is a collection of PORT, these columns will be referred to
necessarily mean more information. powerful mathematical tools that can as the original variables for R.) The C
One can collect reams of computer out- be applied to chemical analysis when matrix is 3 X 2 dimensional for this
put without knowing anything more more than one measurement is ac- example. Again, the three samples oc-
about the system under investigation. quired for each sample. For the sake of cupy the three rows and the two col-
Only when the data are interpreted and consistency, near-infrared (near-IR) umns represent the protein and mois-
put to use do they become valuable to reflectance analysis will be used as an ture content of the samples as deter-
the chemist and to society in general; example of such an analytical method mined by the referee methods.
then data become information. For this throughout this REPORT. However, the According to this notation, the terms
reason, data analysis methods that can techniques described can be used in Cji and c^ represent the protein content

0003-2700/87/A359-1007/$01.50/0 ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987 · 1007 A
© 1987 American Chemical Society
earities). However, problems can be en­ sion of the same C as in the first exam­
countered with the use of real data. A ple onto this new matrix R 2 is given in
model chosen solely according to this S 2 . For this model, Err = 0.07, where
criterion attempts to use all of the vari­ the smaller value implies that this sec­
ance in the R matrix, including any ir­ ond model is more effective at model­
ΙχJ relevant information, to model C. ing C. To further evaluate these re­
When the resulting model is then ap­ sults, note that when R is multiplied by
plied to a new sample, the model will S, the jth column of R is always multi­
assume that the correlation found be­ plied by the jth row of S. The impor­
tween the calibration R and C matrices tance of variable j to the model can
also exists in that sample. Because the therefore be determined by examining
Γ * Ί model was built using irrelevant infor­ the jth row of S. The nonzero entries in
mation in the R matrix, this assump­ the fourth row of S 2 reveal that a vari­
tion will not be true. Unfortunately, able consisting of random numbers
2 12 1 1 5 3 1 even noise has a very high probability (column number four of R 2 ) was chosen
of being used to build the model. The as a significant contributor to the cali­
following example illustrates this bration model. In an ideal analysis, this
12 2 3 7 4 71
point. Consider the matrices random variable would have been ig­
nored in the model-building phase of
75 152 102 the analysis. Furthermore, the inclu­
12 1 3 6 2 6 2
63 132 82 sion of this column in R 2 has changed
R
96 218 176 the estimated model coefficients so
3 4 that they no longer represent the true
69 157 124_
=
model. The upper 3 X 3 portion of S 2 ,
ΙχΚ 2 7 which represents the model for the first
2 7 5
three variables in R 2 , is not equal to S,
8 6 4 3 3 which represents the true model for the
9 12 3 same variables. The addition of a col­
6 8 2 umn of random numbers has resulted
Figure 1. Configuration of R and C in a model that appears to be better, in
matrices. -0.71 0.55 0.48 that it is more effective at reproducing
0.42 - 0 . 4 1 -0.24 C, and yet it does not describe the true
s= relationship. This is because MLR uses
-0.08 0.28 0.05
and moisture content, respectively, for all of the matrix R to build the model,
the ith wheat sample found in the ith MLR was used to determine the S regardless of whether or not it is rele­
row of C. matrix, as described earlier. A standard vant in describing the true model.
The next step in the calibration measure of the effectiveness of the Therefore, an erroneous model can be
phase is the foundation of the entire model is the value of Err as presented derived and subsequently used to pre­
analysis. The analyst must choose an above. For this example Err = 0.49, and dict the characteristics (e.g., protein
appropriate mathematical method it will be assumed that the matrix S is content) of new samples. Thus MLR
that will best reproduce C given the R an accurate estimate of the true model. alone often will generate misleading
matrix. How the analyst defines best The coefficients in S, therefore, will models with subsequent errors in pre­
will ultimately determine the method closely approximate the true relation­ diction.
that is chosen. ship between the variables in R and C. The remainder of this R E P O R T in­
MLR assumes that the best ap­ To illustrate how a calibration meth­ vestigates the advantages of PCR and
proach to estimating C from R is to od can be inappropriate, a column of PLS over MLR. To understand these
find the linear combination of the vari­ random numbers ranging from zero to advantages, it is necessary to under­
ables in R that minimizes the errors in 100 was added to the R matrix. The stand how the methods work. Graphi­
reproducing C. It proposes a relation­ addition of this column is analogous to cal representations of many of the con­
ship between C and R such that C = the inclusion of a wavelength in near- cepts have been included so that the
RS + E, where S is a matrix of regres­ IR analysis that has no useful informa­ reader not well versed in linear algebra
sion coefficients and Ε is a matrix of tion for describing the protein or mois­ will be able to follow the discussions.
errors associated with the MLR model. ture content of the samples. The result­
S is estimated by linear regression, and ing matrix R 2 and MLR model are Notation
the term
75 152 102 9l" Standard linear algebra notation will
be employed throughout this article.
63 132 82 36
Err=|;^(c i k -c i k )2=f;5\ i k 2 R2 - Bold upper case letters refer to matri­
96 218 176 74 ces; plain upper case letters are used for
k= 1i=1 k=1i=1
_69 157 124 51. their row and column dimensions. For
(1)
example, the letter R refers to the ma­
is minimized. In this expression, Cjk is "2 7 5~ trix of responses and is I X J dimen­
the actual concentration element in the 4 3 3 sional. Bold lower case letters signify
ith row and kth column of C, C;k is the C = column vectors such as r 2 , which repre­
9 12 3
MLR estimate of the same element us­ sents the second column of matrix R.
ing S, and «ii; is the corresponding ele­ _6 8 2_ By this convention r 2 T represents the
ment of the matrix E. second column of R written as a row
This approach appears to be reason­ 0.71 0.18 0.42 vector. Plain lower case letters are sca-
able, and in fact it is the best method -0.42 - 0 . 1 9 -0.20 lars, representing single elements of
when the analyst is dealing with well- >2 -
0.24 0.20 0.03 vectors (a;), matrices (ry), or other con­
behaved systems (linear responses, no .-0.12 0.03 0.01 stants such as regression coefficients
interfering signals, no analyte-analyte (b). The transpose of a matrix or vector
interactions, low noise, and no collin- The model resulting from the regres­ is represented by a superscript T. (See

1008 A · ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987


References 5 and 6 for books on linear three columns, graphing R in row space
algebra.)
Matrix representations. Matrices —ESS consists of plotting three points (each
point corresponding to a column) in
two-dimensional space. The coordi­
of low dimensionality (i.e., the num­ R can be represented graphically as R
bers of rows and columns are small in­ in row space or as R in column space. nates of the points representing the
teger values) can be represented graph­ Row space is the space formed with the columns in each dimension are simply
ically. Consider the following 2 X 3 ma­ rows of R as the axes. It is, therefore, the matrix entries in the corresponding
trix two-dimensional. Because there are row. The resulting plot for this matrix

Row2

(4,6)
6-
/ ~*— Column 3

/ , Column 1

/ S s Column 2
(2,1) /

ι
1
ι
1
ι
1 1 1 1
' ι Sim
ι I I I I -7-6-
-7-6-5- t-3-2-1 1
2 3 4 5 6 7 Column 1
Row 1

-6-
Column 2

Figure 2. The matrix R in row space. Figure 3. The matrix R in column space.

1.2 equiv PPh 3


CH 2 C1 2 , 15 min. 25°C

H H
Dithiatopazine: The First Stable 1,2-Ditbietane
K. C. Nicolaou, et. ai, J. Amer. Chem. Soc. 1987, 109, 3801

Two Easy-To-Use Software Tools


Designed By Chemists For Chemists

ChemDrawL & Chem3D


Isn't it time you discovered a more effective way to convey your ideas?
ChemDrawr the desktop publishing software for chemists, and Chem3D™
the molecular modeling package, will do just that-clearly and effectively!
With ChemDraw, you can easily create top-quality drawings of your (
chemical structures, using chemical drawing tools: bonds, arrows, rings, text,
circles, and more. Transfer drawings between ChemDraw and other software to
combine pictures and text for scientific articles. C/
CherrûD provides a wide array of features for manipulating 3-dimensional
molecular models. Define your own atom types and substructures. Create molec-
ular "movies". Import and export structures to ChemDraw, MM2, and other
programs, or build models with CherrûD's powerful building tools.
ChemDraw and CherrûD are designed for Apple " Macintosh™ computers
(including the new Macintosh II), and output to any compatible printer, including
high-quality laser printers and phototypesetters. ChemDraw structures have For more information, please call or write:
appeared in prestigious journals such as the Journal of the American Chemical Cambridge Scientific Computing inc.
Society and Tetrahedron Letters. P.O. Box 2123, Cambridge, HA 02238
617-491-6862
>m3l) are truderm

CIRCLE 28 ON READER SERVICE CARD

ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987 · 1009 A


is shown in Figure 2. J factors. It is important to determine
Analogous to row space, the column factors for greater freedom in repre- Column 2
space has two points and is three-di- senting the useful information in the R
mensional. Figure 3 is a two-dimen- matrix. For example, the information 4.
sional representation of this three-di- in an I X J matrix R can be expressed as
mensional space. This example can be an I X J matrix R', where the columns
used to understand higher dimensional of R' are linear combinations of the 3-
matrices, so that a 10 X 12 matrix original columns in R. The advantage
would have 12 points in 10-dimensional of this is that if a particular column in 2-
row space and 10 points in 12-dimen- R is not useful (as in the MLR example 7 Eigenvector #1
sional column space. Examples in high- cited earlier), one can attempt to find a
er dimensions than three can be under- set of factors that gives that column 1 •

stood if one refers back to two or three small weights when forming R'. Column 1
dimensions and considers situations of This is only one of the advantages of I '
higher dimensionality as expansions of a factor-based calibration method over -2 -1 / 1 2
these simpler cases. the methods that simply use the raw
Projections. Another concept that data. The other possible advantages /
is important to understand is that of can be understood by examining one of
projecting either a point or a vector the factor methods mentioned in the 7-2-
onto a vector or plane. Each of these introduction: PCR (3, 7,8). The follow-
can be viewed as being the perpendicu- ing discussion concerns the prelimi-
lar shadow of one object onto another. nary step of PCR: principal component f -3-
Figure 4 illustrates the result of pro- analysis (PCA). An intuitive feel for
jecting the vector a onto a plane in how PCA works can be gained by con- -4-
three-dimensional space. sidering the result of PCA performed
Factors. The last concept that is on the 5 X 2 matrix R shown below.
very basic to understanding both PCR
and PLS is that of a factor. This is 2 4
Figure 5. The matrix R plotted in col-
because both PCR and PLS are factor- 1 2
umn space with the first eigenvector.
based modeling procedures. For our R = 0 0
purposes, a factor is defined as any lin- 1 -2
ear combination of the original vari- 2 In real analyses, the columns of R are
ables in R or C. It can be shown that -4_
often mean-centered and scaled. Mean
given J factors for an I X J matrix R, In column space, the matrix is five centering simply involves subtracting
one can also represent the variables in points in two dimensions as shown in the average of a column from each of
R as a linear combination of these same Figure 5. the entries in that column. Scaling,
which gives equal weights to each vari-
able, involves dividing each entry by
the variance of the column. PCA is then
performed on the covariance matrix
R T R formed from the mean-centered
and scaled matrix. The first eigenvec-
tor corresponding to the largest eigen-
value is, by definition, the direction in
the space defined by the columns of R
that describes the maximum amount of
variation or spread in the samples. Fig-
ure 5 shows the data and the direction
of the first eigenvector where the space
defined by R is a plane.
In this example, all of the variation
in the data can be described using one
eigenvector. The samples all fall on a
line in column space, and therefore all
of the variation lies in one direction.
When all of the variation in the sam-
ples cannot be accounted for using only
one eigenvector, a second eigenvector
can be found that is perpendicular or
orthogonal to the first and describes
the maximum amount of residual vari-
ation (not described by the first eigen-
vector) in the data set. Figure 6 is the
plot of a 30 X 2 matrix and the associat-
ed first two eigenvectors. The direction
of the first eigenvector describes the
maximum amount of variation or
spread in the data. In this particular
case, the samples in column space hap-
pen to fall within a two-dimensional
ellipse, and the first eigenvector corre-
Figure 4. The projection of a vector onto a plane. sponds to the major axis of the ellipse.

1010 A · ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987


To illustrate how eigenvectors can thogonal to the variable [cos(90°) = 0]. columns of R that includes all of the
define factors for a matrix R, the re­ In the example given in Figure 6, the original variation. One can visualize
sults of the eigenvector analysis shown second eigenvector corresponds to the this by considering the column space
in Figure 5 are presented. The eigen­ minor axis of the ellipse. Because the spanned by R as a room and the sam­
vector in Figure 5 is equal to the vector eigenvectors lie in the same space as ples lying on a table in the room. The
v T = [0.447, 0.894]. In vector notation, the original variables, they can be used column space is three-dimensional, but
the first factor or principal component as a new set of axes for the matrix R. only two dimensions are necessary to
can be shown to equal a linear combi­ Each point or sample has a new set of describe the position of each of the
nation of the original variables as fol­ coordinates from their projections onto sample points relative to the others.
lows: the new axes, and these axes may be The two eigenvectors necessary to de­
more useful for describing the variation scribe the samples would define a plane
Rv = u (2) or spread of the samples. The new axes that includes the top of the table. This
also can be viewed as a rotation of the characteristic of the factor-based
where R is the original matrix of re­ original axes. In Figure 6 the transfor­ methods is exploited to yield more in­
sponses, ν is the first eigenvector of mation from the old to the new axes can formative models, as is discussed in the
R T R, and u is the so-called score vec­ be achieved by rotating the original following sections.
tor, which is the projection of R onto axes by approximately 45 degrees.
the first eigenvector. The vector u is a In this example, the object (ellipse) Multiple linear regression
factor for the columns of R because it formed by the sample points in column The first method considered, MLR,
can be written as a linear combination space was two-dimensional. In more does not attempt to model underlying
of the columns. Likewise, when J eigen­ complicated situations the object can structure or factors that may exist in R
vectors are calculated and used to form be of higher dimensionality than two, or C. The goal of MLR is to find the
an I X J matrix U of scores, the original and the process of determining eigen­ linear combination of the variables η
variables in R can be expressed as the vectors continues until all of the varia­ such that the model's estimates of the
linear combination of the scores: tion in the samples is explained. The values of the c's in the calibration set
dimensionality of the space containing given by
RVV T = UV T (3) the samples equals the number of col­ j
T
R = UV (VV ) T _1
(4) umns in R; thus the maximum possible
dimensionality of the samples equals
j=i
where V T (VV T ) _ 1 , obtained from lin­ the number of columns in R. The col­
ear algebra, is called the generalized lection of points, however, often can be are as close to the known values of C as
inverse (GI) of V. In PCA notation, the presented by fewer dimensions. This possible. As stated earlier, the criterion
elements in the GI of V are called the occurs when there is collinearity be­ of closeness for MLR is defined as min­
principal component "loadings." tween the variables in the matrix under imizing the sum of the squares of the
These loadings range from —1 to +1 investigation. The presence of collin­ deviations of the predicted values from
and are the cosines of the angles be­ earity means that some of the variables the true values (see Equation 1). In sta­
tween the eigenvector and the variable in the original matrix are correlated to tistical language, MLR is the regres­
axes. High loadings correspond to high the other variables and therefore con­ sion of the columns of C onto the space
correlations (large coefficients) where tain redundant information. In this sit­ defined by the columns of R. The
the angle between the eigenvector and uation, PCA can be used to describe the mathematical procedure for this can be
a variable is very small [cos(0°) = 1]. variation in R using fewer dimensions found in most books on multivariate
Small loadings correspond to low corre­ than the number of columns in R. The analysis (3).
lations (small coefficients) where the eigenvectors can be used to determine a Graphically, MLR finds the projec­
eigenvector is orthogonal or nearly or­ subspace in the space defined by the tion of the columns of the matrix C

Figure 6. A 30 X 2 matrix plotted in column space with Figure 7. The plane formed by the two columns of
the first two eigenvectors. matrix R plotted in row space.

1012 A · ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987


two columns of R. In the general case of
an I X J matrix R, the first principal
component will be a vector in I-dimen-
sional space that has the smallest sum
of squared errors when used to esti­
mate the J column vectors in R.
The difference between MLR and
PCR lies in the re-expression of R as a
score matrix U. This matrix results
from projecting R onto the eigenvec­
tors V, as RV = U. The score matrix U
is composed of the original data points
in a new coordinate system described
by the eigenvectors. The second step of
PCR uses MLR to regress the C matrix
onto the score matrix as U S = C. The
Figure 8. Geometric representation of the regression of c onto R. eigenvectors, V, and the matrix of re­
gression coefficients, S, together form
onto the variables or columns of R in its vector (e.g., r u n k T = [10 5]), one multi­ the PCR model.
row space. To illustrate this, consider plies the vector r un k by s as follows: The prediction step uses the V and S
the following matrix and vector matrices derived in the calibration set
0.20
[10 5] = 6.25 = c pred as follows. Multiply the new sample re­
"39 29" "30" 0.85 sponse vector or matrix R un k by V to
R = 18 15 c = 20 where cpred is the predicted concentra­ obtain Uunk> then multiply Uunk by S to
.11 6_ -10, tion in the new sample. yield Cunk, where Cunk contains the es­
where R might represent a near-IR The method of MLR is indeed an timates of the concentrations of the an-
data set with two columns of absor- adequate procedure in ideal situations. alytes in the unknown samples.
bance measurements at two wave­ However, it is based on the problematic If for an I X J matrix R all of the J
lengths and three rows corresponding step of inverting a matrix, and it can eigenvectors are used to form V, PCR
to three spectra from different samples incorporate significant amounts of ir­ will yield results identical to MLR. The
of wheat, and vector c represents the relevant variance (information) into advantages of PCR are based on the
protein content of the three wheat sam­ the model. concentration of the information in R
ples from Kjeldahl nitrogen analysis. into fewer factors than J. Because the
Principal component regression eigenvectors optimally describe the
Figure 7 is a graph of the column
vectors r t and r 2 plotted in row space. This model-building procedure has maximum variation in the original
The two columns of R define a sub- two steps. The first step is the determi­ variables, they can be used to deter­
space (a tilted plane) within the row nation of the eigenvectors or factors for mine the rotation that best describes
space, and regressing c onto R is equiv­ a matrix R, as was described in the the information in the matrix R using
alent to finding the projection of c onto section on factors. This information is the minimum number of dimensions.
the subspace defined by R. To perform used to convert R into a score matrix U The first factor is the most descriptive;
MLR, one could plot the c vector in by projecting the R matrix onto the each successive factor describes less of
this same row space and project c onto space defined by the eigenvectors (see the information contained in R. Delet­
the plane formed by Γ] and r 2 . Figure 8 Equation 2). Until now, no attempt has ing some of the latter factors that con­
is an expanded view of the projection of been made to describe the rationale be­ tain mostly noise therefore does not
c onto the plane, where proj c is the hind the use of eigenvectors as the new significantly reduce the amount of use­
estimate of c from matrix R. The re­ axes. We will show that there is a logi­ ful information present. This reduction
sulting regression coefficients for the cal interpretation of the significance of of dimensionality is often of utmost im­
above example are given as s the eigenvectors. If the analyst is to portance for practical and mathemati­
redefine the variables using a smaller cal stability. Another important char­
0.20' number of factors, one reasonable ap­ acteristic of the matrix U is that if the
0.85 proach is to find the factors that best eigenvectors are calculated from a sym­
describe the original variables. The metric matrix such as the correlation or
From s an estimate of c can be obtained first principal component is simply the
as c = proj c = Siri + β2Γ2· Or in matrix linear combination of the original vari­
notation, ables that points in a direction that is
most correlated to all of the columns in Row 2
"39 29" "32.5" row space. In mathematical terms, it is
"0.20"
proj c 18 15 — 16.4 the linear combination that has the
_0.85_ 3--
.11 6. - V.3_ smallest squared errors when used to
Another point illustrated by this ex­ estimate the original variables. It is
ample is that the ability to estimate c also the direction in column space that -4-3-2-1 1 2 3 4
depends on how close c lies to the sub- best describes the variation in the sam­ H—I—I—r- •^ ι I—|—|— Row 1
space defined by R. In this example, c ples. To illustrate these characteristics,
consider the following matrix R: ' V ^ " * * * " Column 1
is near that subspace, and R's estimate - \ \ ^ Principal
of c, proj c, is the vector in the plane
R -3-- V component #1
defined by ri and r 2 that lies closest to Column 2
c. The model error can be described
geometrically as the length of the seg­ Figure 9 is a representation of the col­
ment joining the heads of vector c and umns of R and the first principal com­
proj c. ponent plotted in row space. No other
To predict the concentrations of an vector in this space will yield a smaller Figure 9. The matrix R in row space
unknown sample with a given response sum of errors when used to estimate the with the first principal component.

1014 A · ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987


model C. With PCR, the rotation de­
fined by the eigenvectors was used to
find a subspace in R that subsequently
was used to model C. The approach
taken by PLS is very similar to that of
PCA, except that factors are chosen to
describe the variables in C as well as in
R. This is accomplished by using the
columns of the C matrix to estimate the
factors for R. At the same time, the
columns of R are used to estimate the
factors for C. The resulting models are
R = TP + Ε (5)
and
C = UQ + F (6)
where the elements of Τ and U are
called the scores of R and C, respec­
tively, and the elements of Ρ and Q are
called the loadings. The matrices Ε and
F are the errors associated with model­
ing R and C with the PLS model.
The Τ factors are not optimal for es­
timating the columns of R as was the
case with PCA, but are rotated so as to
simultaneously describe the C matrix.
In the ideal situation, the sources of
variation in R are exactly equal to the
sources of variation in C, and the fac­
tors for R and C are identical. In real
applications, R varies in ways not cor­
related to the variation in C, and there­
fore t ^ u . However, when both matri­
ces are used to estimate factors, the
factors for the R and C matrices have
the following relationship
u=bt + ( (7)
where the b is termed the inner rela­
tionship between u and t and is used to
calculate subsequent factors if the in­
trinsic dimensionality of R is greater
than one.
Geometrically, this states that the
vectors u and t are approximately
Figure 10. Geometric representation of the workings of PLS. equal except for their lengths. To get a
feel for where these factors lie in rela­
tionship to the original R and C matri­
covariance matrix, each of the columns ment. At the crucial factor-building ces, one can try to picture both of these
in U will be mutually orthogonal. step, PCR ignores the information con­ matrices and their factors plotted in
One way of viewing this procedure is tained in the C matrix. The C matrix is row space. R will have J vectors, one for
in terms of the space defined by R. not used until the second step of the each column, whereas C will have Κ
MLR uses all of the space described by procedure, when the factors in R have vectors corresponding to its columns.
the columns of R, whereas PCA deter­ already been determined. The correla­ The factor t will lie close to the J vec­
mines a subspace by possibly ignoring tion between these factors and C is not tors for R and will be a good estimator
some of the eigenvectors. When one of taken into account when the factor for these vectors. Likewise, u will lie
the J columns in R is a linear combina­ model is built. The rotation that is de­ close to the Κ vectors for C. If there is a
tion of the other J-l columns, R can be termined has very good descriptive good underlying model that relates R
represented as a matrix U with J-l col­ qualities concerning the R matrix, but to C, this will be evident by the similar­
umns without a loss of information. there is not much reason to assume it ity of the vectors u and t in this space,
This new matrix U defines a subspace will optimize the modeling of C. The and e in Equation 7 will have a small
of R. The first example in this article ultimate goal is to model the informa­ value. If the first principal component
illustrated how a noisy variable can ad­ tion in the C matrix, hence it seems were plotted in this same space, one
versely affect the true model relating R reasonable that C should be used to aid would find that the t vector would dif­
to C. It would be advantageous to ig­ in determining the factors for R. fer from it in that t would be rotated
nore this variable by finding a portion toward the C vectors.
of the space (a subspace) defined by R Partial least squares In a similar manner, one can attempt
that would be more effective at model­ The method of PLS {9,10) is a model­ to view PLS working in the column
ing C. ing procedure that simultaneously esti­ spaces of R and C. Figure 10 represents
The advantages of PCR make it the mates underlying factors in both R and a hypothetical example of a 6 X 3 re­
obvious choice over MLR in many situ­ C. These factors are used to define a sponse matrix and corresponding 6 X 3
ations, but there is room for improve­ subspace in R that is better able to concentration matrix plotted in their

ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987 · 1015 A


factor model that simultaneously de­ linearly related with a coefficient, b;,
scribes R and C. describing the relationship for each of
For an example of the PLS algo­ the L factors.
rithm, the same R and C matrices used To perform prediction on an un­
in the MLR example will be employed. known sample, the model used to de­
The data were mean-centered before rive the scores of R and C and the rela­
the analysis, and the following two fac­ tionship bf are used in the following
tors were determined. manner. The vector of responses runit
for the unknown sample is taken
20.5" 10" through the calibration model (P), and
-4.7 u = 0 a score vector (tunk) is calculated. Using
t =
b; and Equation 7, the score tun^ yields
.-15.8. .-ίο. an estimate of the scores for the pre­
dicted matrix of concentrations (uunk)·
Plotting t versus u (Figure 11) reveals
The vector uunk is then transformed
the nearly linear relationship that ex­
into concentration estimates using the
ists between them. The calculated val­
calibration model for the C matrix (Q).
ue of b as defined in Equation 7 is 0.53.
If the intrinsic dimensionality of R is For more information concerning the
greater than one, that is, more than one mechanics of calibration and predic­
Figure 11. The R matrix factor (t) plot­ tion using the PLS algorithm, see Ref­
ted versus the C matrix factor (u).
factor is necessary to describe the vari­
ation, additional factors are deter­ erence 11. For an even more rigorous
mined. The relationship given in Equa­ mathematical treatment of PLS in the
tion 7 is then used to describe the mod­ context of singular value decomposi­
column spaces. The directions of the
el for C. Instead of using C = UQ + F, tion and the power method, see Refer­
first PLS factors (t and u) would ex­
one can substitute the equality stated ence 12. The main point of this section
plain much of the variation of the sam­
in Equation 7 to yield C = bTQ + G. In is that PLS estimates factors for R and
ples in the respective spaces. If the PLS
other words, one can estimate the C using all of the information available.
model is valid, plotting t versus u will
scores of C from the scores of R. In PLS these factors u and t are called
also yield a linear relationship. In fact,
latent variables and are similar to the
PLS will compromise the ability of the An analysis of data using PLS can
principal components in PCR.
factors to describe the samples in the then be summarized as the determina­
individual spaces (Figure 10, top) to tion of factors in R and C using all of
increase the correlation of t to u (Fig­ the information available. The final Conclusion
ure 10, bottom). It is this compromise model consists of score matrices Τ and It is instructive to summarize briefly
that allows for the determination of a U (called scores for R and C) that are the mechanics of each of the methods
discussed in this REPORT. MLR mod­
els C from R using a least-squares cri­
terion. All of R is available to model C,
and the method is dependent on the
inversion of a matrix. PCR first models

APPLIED MATHEMATICIAN R by a score matrix U and then uses


MLR to estimate the relationship be­
tween U and C. The possibility of de­
Boehringer Mannheim is a rapidly growing leader in the field of med­
ical diagnostics and medical research. This leadership has been at­ leting factors and thus reducing dimen­
tained through emphasis on quality and team effort and Boehringer sionality was discussed as its major ad­
Mannheim employees are responsible for the growth and leadership. vantage. Finally, PLS builds factors for
Due to this growth, we are currently seeking a qualified individual for R and C using both matrices. The ad­
the following opportunity in our INDIANAPOLIS corporate head­ vantage of PLS over PCR is that it in­
quarters. corporates more information in the
Will implement packaged and tailored algorithms for support of re­ model-building phase.
search and commercial systems. Requires B.S. degree in Computer The advantages gained by employing
Science, Chemistry, Engineering, Physics or Mathematics with 5 the factor-based methods are realized
years experience to include convolutions techniques, database work, because of the characteristics of the
curve fitting and statistical techniques. factors. The first advantage of factor
At Boehringer Mannheim we offer an exciting and challenging work models comes from a computational
environment as well as excellent compensation and benefits pro­ limitation of MLR. MLR cannot be
grams. If you meet the above requirements and would like to join an used in cases where the columns or
organization that is a leader in its field, please send your resume with measurement variables in R are linear
salary history for confidential consideration. To inquire about other combinations of each other. This is be­
opportunities not listed above, please call our job line at (317)
cause MLR is based on the inversion of
845-7035.
the matrix R T R, which is not possible if
Dept. MZAC the matrix R has columns that are ex­
Human Resources
act linear combinations of the remain­
Boehringer Mannheim
P.O. Box 50100 ing columns. Furthermore, the inver­
Indianapolis, IN 46250-0100 sion process is unstable when some col­
umns are nearly linear combinations of
BOEHRINGER others. A model built on such an in­
MANNHEIM verse will give large errors in concentra­
Λ'-Ι.Ιίι.ΜΤΙ
DIAGNOSTICS tion estimates when noise is present in
the R matrix. Much work has been
done in statistics in the area of variable
An Equal Opportunity Employer
selection to eliminate or adjust for col-
linearities among the variables of R

1016 A · ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987


(e.g., stepwise regression, Mallow's sta­ (7) Rummel, R. J. Applied Factor Analy-
tistics [3]). These nonfactor-based
methods are disadvantageous in that
they are based on the ultimate deletion
of variables, and in doing so they may
eliminate useful information. Factor-
sis; Northwestern University Press: Ev-
anston, 1970.
(8) Jackson, J. E. Journal of Quality Tech-
nology, 1980,12, 201-13.
(9) Wold, H. In Research Papers in Statis-
tics; David, F., Ed.; Wiley: New York,
1966, pp. 411-44.
NEW
based models can eliminate collineari- (10) Wold, H. In Systems Under Indirect
ties without removing useful informa­ Observation, vol. 2; Joreskog, H.; Wold,
tion. H., Eds.; North-Holland: Amsterdam,
1982, pp. 1-54.
The second main advantage of fac­ (11) Geladi, P.; Kowalski, B. R. Anal.
tor-based models is the possibility of Chim. Acta, 1986,185,1-17.
removing noise from the data matrix. (12) Lorber, Α.; Wangen, L.; Kowalski,
Again, using PCA as an example one B. R. Journal of Chemometrics, 1986, /,
can demonstrate that the inclusion of 1.
(13) Wold, S. Technometrics, 1978,20,397.
all of the eigenvectors can actually de­ GROTON
crease the predictive ability of the TECHNOLOGY
model. To demonstrate when this is oc­
curring, the analyst can use cross vali­
introduces
dation (73) and build a factor-based HPLClike
calibration model using a set of sam­ never before.
ples termed the training set. This cali­
bration model is then applied to the R
matrix of an independent set of sam­
ples (termed the test set), and an esti­
mate of the C matrix for the test set is VALUE
obtained. (For the test set, the C ma­ Best value today in a
trix is also known.) The sum of squared D.A.D. system.
deviations of the predicted values for
the Cij's and the true c.j's are calculated
for the test set. This value is termed the PERFORMANCE
Bruce R. Kowalski received his Ph.D. Unparalleled resolu­
PRESS value (predictive residual error in chemistry from the University of
sum of squares). Factors are then add­ Washington in 1969. In 1972, he joined
tion. Incomparable
ed one at a time, and the PRESS value the chemistry faculty at Colorado State sensitivity.
is calculated. What usually occurs is University and then returned to the
that the PRESS value levels off or University of Washington, where he is
reaches a minimum before all of the a member of the chemistry faculty and
VERSATILITY
factors have been included. Inclusion codirector of the Center for Process True IBM
of additional factors beyond this point Analytical Chemistry. His research in­ compatibility-
actually results in an increased PRESS terests in analytical chemometrics in­ options for your
value, which means poorer prediction. clude developing new methods of mul­ application.
This is because the random variation or tivariate analysis for the resolution
noise in the R matrix is used to fit the C and calibration steps in analytical
matrix (this is termed overfitting). By instrumentation, extending various FUNCTIONALITY
not including the factors that describe chromatography-spectrometry meth­ Full function software
mainly noise, it is possible to increase ods, and using chemical sensors in pro­ overlays, ratios and
the predictive ability of the model. cess analysis and control. more.
Both PCR and PLS offer the capability
of such noise reduction.
Data contain relevant information as
well as noise and irrelevant informa­
tion. Chemometricians and statisti­
cians are working to develop the best
possible tools that can extract the full
complement of useful information
from all of the data chemists acquire in
their investigations.

References
(1) Sharaf, Μ. Α.; Illman, D. L.; Kowalski,
B. R. Chemometrics; Wiley: New York,
1986.
(2) Malinowski, E. R.; Howery, D. G. Fac­ Kenneth R. Beebe is a research assis­
tor Analysis in Chemistry; Wiley: New
York, 1980. tant at the University of Washington. For information call
(3) Draper, N.; Smith, H. Applied Regres­ He received his B.S. in chemistry in (617)647-9400.
sion Analysis, 2nd éd.; Wiley: New York, 1981, and his M.S. in 1985 from the
1981. University of Washington. Currently
(4) Wetzel, D. L. Anal. Chem. 1983, 55, GROTON
1165A. he is enrolled in the Ph.D. program in
(5) Strang, G. Linear Algebra and Its Ap- the analytical division of the chemis­ TECHNOLOGY
plications; Academic Press: New York, try department. His research interest
1980. is in the application of various chemo­
INCORPORATED
(6) Wesson, J. R. Lessons in Linear Alge- Waltham, MA 02154
bra; Charles E. Merrill Publishing Co.: metrics techniques to optimize multi­
Columbus, 1974. variate calibration procedures.
CIRCLE 61 ON READER SERVICE CARD
ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987 · 1017 A