You are on page 1of 4

Properties of the Trace and Matrix Derivatives

John Duchi

Contents
1 Notation 2 Matrix multiplication 3 Gradient of linear function 4 Derivative in a trace 5 Derivative of product in trace 6 Derivative of function of a matrix 7 Derivative of linear transformed input to function 8 Funky trace derivative 9 Symmetric Matrices and Eigenvectors 1 1 1 2 2 3 3 3 4

Notation

A few things on notation (which may not be very consistent, actually): The columns of a matrix T A Rmn are a1 through an , while the rows are given (as vectors) by a T m. 1 throught a

Matrix multiplication
n

First, consider a matrix A Rnn . We have that AAT =


i=1

ai aT i ,

that is, that the product of AAT is the sum of the outer products of the columns of A. To see this, consider that
n

(AAT )ij =
p=1

api apj

because the i, j element is the ith row of A, which is the vector a1i , a2i , , ani , dotted with the j th column of AT , which is a1j , , anj .

If we look at the matrix AAT , we see that n n p=1 ap1 ap1 p=1 ap1 apn . . .. . . AAT = = . . . n n p=1 apn ap1 p=1 apn apn

ai1 ai1 . . . i=1 ain ai1


n

.. .

ai1 ain . . = . ain ain

ai aT i
i=1

Gradient of linear function

Consider Ax, where A Rmn and x Rn . We have x a T 1x x a T 2x 1 a 2 x Ax = = a . . . x a T mx

a m

= AT

Now let us consider xT Ax for A Rnn , x Rn . We have that


T T T T T xT Ax = xT [ aT nx 1 x + + xn a n x] = x1 a 2 x a 1xa

If we take the derivative with respect to one of the xl s, we have the l component for each a i , which is to say ail , and the term for xl a T x , which gives us that l T x Ax = xl In the end, we see that
n T T xi ail + a T l x. l x = al x + a i=1

x xT Ax = Ax + AT x.

Derivative in a trace

Recall (as in Old and New Matrix Algebra Useful for Statistics ) that we can dene the dierential of a function f (x) to be the part of f (x + dx) f (x) that is linear in dx, i.e. is a constant times dx. Then, for example, for a vector valued function f , we can have f (x + dx) = f (x) + f (x)dx + (higher order terms). In the above, f is the derivative (or Jacobian). Note that the gradient is the transpose of the Jacobian. Consider an arbitrary matrix A. We see that T a 1 dx1 .. tr . n T a n dxn a T dxi tr(AdX ) = = i=1 i . dX dX dX Thus, we have tr(AdX ) dX so that =
ij n i=1

a T i dxi = aij xji

tr(AdX ) = A. dX Note that this is the Jacobian formulation. 2

Derivative of product in trace


A trAB = B T trAB = tr = tr
m

In this section, we prove that

a1 a2 b . 1 . . an a1 T b1 a2 T b1 . . . an T b1 a1 T b2 a2 T b2 an T b2
m

b2

bn

.. .

a1 T bn a2 T bn . . . an T bn
m

=
i=1

a 1i b i 1 +
i=1

a 2i b i 2 + . . . +
i=1

ani bin

trAB aij A trAB

= =

bji BT

Derivative of function of a matrix


AT f (A) = (A f (A))T . AT f (A) =
f (A) A11 f (A) A12 f (A) A21 f (A) A22

Here we prove that

. . .

. . .

f (A) A1n

= (A f (A))

f (A) A2n T

.. .

f (A) An1 f (A) An2

. . .

f (A) Ann

Derivative of linear transformed input to function

Consider a function f : Rn R. Suppose we have a matrix A Rnm and a vector x Rm . We wish to compute x f (Ax). By the chain rule, we have f (Ax) xi
n

=
k=1 n

f (Ax) (Ax)k = (Ax)k xi f (Ax) aki = (Ax)k


n

n k=1

f (Ax) ( aT k x) (Ax)k xi

= =

k f (Ax)aki
k=1

k=1 aT i f (Ax).

As such, x f (Ax) = AT f (Ax). Now, if we would like to get the second derivative of this function (third derivatives would be a little nice, but I do not like tensors), we have 2 f (Ax) xi xj = =
l=1 k=1

T a f (Ax) = xj i xj
n n

aki
k=1

f (Ax) (Ax)k

2 f (Ax) aki ali (Ax)k (Ax)l

2 = aT i f (Ax)aj T 2 From this, it is easy to see that 2 x f (Ax) = A f (Ax)A.

Funky trace derivative


A trABAT C = CAB + C T AB T .

In this section, we prove that

In this bit, let us have AB = f (A), where f is matrix-valued. A trABAT C = A trf (A)AT C = trf ()AT C + trf (A) T C = (AT C )T f () + (T trf (A) T C )T = C T AB T + (T tr T Cf (A))T = C T AB T + ((Cf (A))T )T = C T AB T + CAB

Symmetric Matrices and Eigenvectors

In this we prove that for a symmetric matrix A Rnn , all the eigenvalues are real, and that the eigenvectors of A form an orthonormal basis of Rn . First, we prove that the eigenvalues are real. Suppose one is complex: we have T x = (Ax)T x = xT AT x = xT Ax = xT x. x Thus, all the eigenvalues are real. Now, we suppose we have at least one eigenvector v = 0 of A. Consider a space W of vectors orthogonal to v . We then have that, for w W , (Aw)T v = wT AT v = wT Av = wT v = 0. Thus, we have a set of vectors W that, when transformed by A, are still orthogonal to v , so if we have an original eigenvector v of A, then a simple inductive argument shows that there is an orthonormal set of eigenvectors. To see that there is at least one eigenvector, consider the characteristic polynomial of A: X (A) = det(A I ). The eld is algebraicly closed, so there is at least one complex root r, so we have that A rI is singular and there is a vector v = 0 that is an eigenvector of A. Thus r is a real eigenvalue, so we have the base case for our induction, and the proof is complete. 4

You might also like